Tinyhack.com

Teensy LC U2F key

Around beginning of last month, GitHub users can buy a special edition U2F security keys for 5 USD (5000 keys were available), and I got two of them. Universal 2nd Factor (U2F) is an open authentication standard that strengthens and simplifies two-factor authentication using specialized USB or NFC devices.

A U2F USB key is a second factor authentication device so it doesn’t replace our password. To login to a website, we need to enter our username and password, AND the U2F USB key. To check for user presence (to prevent malware from accessing the key without user consent), the device usually has a button that needs to be pressed when logging in.

Currently Google (Gmail, Google Drive, etc), Github, and Dropbox supports U2F devices, and we can also add support to our own site or apps using plugins or accessing the API directly (plugin for WordPress is available).

After receiving the keys, I got curious and started to read the U2F specifications. The protocol is quite simple, but so far I haven’t been able to find an implementation of a U2F key device using existing microcontrollers (Arduino or anything else). The U2F protocol uses ECC signing and I found that there is already a small ECC library for AVR and ARM (micro-ecc). It supports ECDSA with P-256 curve required by U2F.

A U2F device is actually just a USB HID Device, so I will need something that I can easily program as an HID device. The easiest device to program that I have is Teensy LC. I tested compiling the micro-ecc library, and found out that it results in about 15 kilobytes of code, so Teensy LC should be OK (it has 64 kbyte flash, and 8KB of RAM). Teensy LC is also very small, it’s ideal if someday I want to put a case around it.

I can’t find an easy way to add new USB device using Teensyduino, so I decided to just patch the usb_desc.h, the only changes needed was to change the RAWHID_USAGE_PAGE to 0xf1d0 and RAWHID_USAGE to 0x01. I changed the PRODUCT_NAME to “Teensyduino U2FHID” just to make it easy to check that this works. The nice thing is: this doesn’t break anything (all code using RawHID would still run with this changes), and we can still see our code output using the virtual serial port provided by Teensyduino.

#elif defined(USB_RAWHID)
  #define VENDOR_ID		0x16C0
  #define PRODUCT_ID		0x0486
//  #define RAWHID_USAGE_PAGE	0xFFAB  // recommended: 0xFF00 to 0xFFFF
//  #define RAWHID_USAGE		0x0200  // recommended: 0x0100 to 0xFFFF
  #define RAWHID_USAGE_PAGE	0xf1d0  // recommended: 0xFF00 to 0xFFFF
  #define RAWHID_USAGE		0x01  // recommended: 0x0100 to 0xFFFF

  #define MANUFACTURER_NAME	{'T','e','e','n','s','y','d','u','i','n','o'}
  #define MANUFACTURER_NAME_LEN	11
  #define PRODUCT_NAME		{'T','e','e','n','s','y','d','u','i','n','o',' ','U','2','F','H','I','D'}

The U2F protocol is actually quite simple. When we want to use the hardware U2F key in a webapp (or desktop app), we need to add the USB key that we have to the app database. Practically, in the website, you would choose a menu that says “Add device” or “register new device”.

When you choose the register/add device, the app will send a REGISTER request to they hardware U2F USB key with a unique appid (for web app, this consist of domain name and port). The hardware U2F key will generate a private/public key pair specific for this app id, and the hardware U2F key will respond by sending a “key handle” and a “public key” to the app. If we have several usernames in an app/website, we can use a single hardware U2F key to be used for all accounts (the “key handle” will be different for each account).

Next time the user wants to login, the app/webapp will send authentication request to the hardware U2F key. In practice, when logging in, the website will request you to plug the hardware U2F key and press the button in the hardware key.

The app will send a random challenge and the appid (to identify which app it is), and the “key handle” (so the hardware U2F key will know which private key to use to sign the request). The hardware U2F key will reply with the same random challenge signed with the private key corresponding with the “key handle”, and it will also increase a counter (the counter is to prevent re-play attack and cloning attack).

There are two ways the hardware U2F key can keep track of which private key to use for a “key handle”: first one is to store a mapping of key handle to private key in a storage in the hardware U2F key, and when an app asks for a specific key handle, it can look up the private key in the storage. The second method is easier, and doesn’t require any storage, but slightly less secure: the “key handle” actually contains the private key itself (in encrypted form, otherwise anyone can send the request). Since the Teensy LC only contains 128 of EPROM, I used the second approach.

Google provides U2F reference code including something to test USB U2F keys. I started using this to test my implementation step by step using HidTest and U2Ftest. In retrospect this was not really necessary to get a working U2F key for websites. There are cases that just wouldn’t happen normally, and sometimes the test requires strange assumption (for example: as far as I know nothing in the specification says that key handle size must be at least 64 bytes in size).

Teensy LC doesn’t provide a user button (just a reset button), and I don’t want to add a button to it (it wouldn’t be portable anymore). So I just implemented everything without button press. This is insecure, but it’s ok for me for testing. For “key handle” I use a very simple xor encryption with fixed key which is not very secure. If you want a more secure approach, you can use a more complicated method.

Most of the time implementing your own device is not more secure than buying commercial solution, but sometimes it has some advantages over commercial solutions. For example: most devices that I know of doesn’t have a ‘reset’ mechanism. So if for instance you are caught having a device, and they have access to a website data, they can prove from your device that you have an account in that site (there is a protocol to check if a given key handle is generated by a hardware U2F device).

In our custom solution we can reset/reflash our own device (or just change the encryption key)) and have a plausible deniability that we are not related to that site (the suggestion in the U2F specification was to destroy a device if you no longer want to associate a website with your device if your device doesn’t have reset mechanism).

teensy

I have published my source in github in case someone wants to implement something similar for other devices (or to improve my implementation). I have included the micro-ecc source because I want to experiment by removing some unneeded functions to reduce the code size (for example: we always use uncompressed point representation for U2F, we only use a single specific Curve, we never need to verify a signature, etc). You should change the key “-YOHANES-NUGROHO-YOHANES-NUGROHO-” for your own device (must be 64 characters if you want security). There are still a lot of things that I want to explore regarding the U2F security, and having a device that I can hack will make things easier.

Update: some people are really worried about my XOR method: you can change the key and make it 64 bytes long. It’s basically a one-time-pad (xoring 64 bytes, with some unknown 64 bytes). If you want it to be more secure: change the XOR into anything else that you want (this is something that is not specified in the standard). Even a Yubico U2F device is compromised if you know the master key, in their blog post, they only mentioned that the master key is generated during manufacturing, and didn’t say if they also keep a record of the keys.

Update again: this is not secure, see http://www.makomk.com/2015/11/10/breaking-a-teensy-u2f-implementation/.

Regarding the buttonless approach: it’s really easy to add them. In my code, there is an ifdef for SIMULATE_BUTTON. It will just pretend that the button was not pressed on first request, and pressed on second request. Just change it so that it really reads a physical button.

Exploiting the Futex Bug and uncovering Towelroot

The Futex bug (CVE-2014-3153) is a serious bug that affects most Linux kernel version and was made popular by geohot in his towelroot exploit. You can read the original comex report at hackerone. Others have succesfully implemented this (this one for example), but no public exploit source code is available.

This post will describe in detail about what exactly is the futex bug, how to exploit the futex bug, and also explains how towelroot works and what the modstring in Towelroot v3 actually do. Following the footsteps of other security researchers, I will not give a full source code to the exploit. By giving enough details, I hope programmers can learn and appreciate the exploit created for this bug. By not releasing the source, I hope this should stop most script kiddies. There will be some small details that I will gloss over (about the priority list manipulation), so it will require some thinking and experimentation to implement the exploit.

One thing to note: I did some kernel programming, but never written a kernel exploit before, so this is my first time, I hope this is a good write up for a newbie exploit writer like me. Distributing the exploit source code will be useful only to handful of people, but writing about it will be useful to all programmers interested in this.

Towelroot is not opensource, and the binary is protected from reverse engineering by compiling it with llvm-obfuscator. When I started, I tried using 64 bit kernel on my desktop, and was not successful because I can’t find a syscall that can alter the stack in the correct location. So I decided to do a blackbox reverse engineering by looking at syscalls used by towelroot. Since (I think) I know how towelroot works, I will discuss about it, and I hope it will help people to understand/modify modstrings used in towelroot v3.

If you are an exploit hacker, just jump to “on to the kernel” part. The initial part is only for those unexperienced in writing exploits.

Before exploiting this bug we need to understand what the bug is. In short, the bug is that there is a data structure in the stack, that is part of a priority list that is left there and can be manipulated. This is very vague for most of programmers, so lets break it down. You need to understand what a stack is and how it works.

The Stack

A stack is a block of memory set aside for local variable, for parameter passing, and for storing return address of a procedure call. Usually when we talk about stack and exploits, we try to alter the return address and redirect it to another address and probably do ROP (Return Oriented Programming). This is not the case with this bug, so forget about that. Even though this bug is about futex, this is also not a race condition bug.

Stack memory is reused accross procedure calls (not cleared) . See this simple example:

You can compile then run it:

gcc test.c -o mytest

Note: just compile normally, don’t use any optimization level (-Ox):

$./mytest
local foo is 10
local foo is 12

As you can see, both procedures uses exactly the same stack layout. Both local uses the same stack location. When bar is called, it writes to the same stack location used by “foo” for its local variable. This simple concept will play a role in understanding the bug.

Next topic is about linked list. Usually a pointer based data structure uses heap for storing the elements (We don’t use stack because we want the elements to be “permanent” accross calls), but sometimes we can just use the stack, as long as we know that the element is going to be removed when the function exits, this will save time in allocation/deallocation. Here is an example of a made up problem where we put in an element located in the stack then removing it again before returning.

In that example, if we don’t remove the element and we return, the app will likely crash when the stack content is altered then the list js manipulated. That is the simplified version of the bug in the userland.

To exploit the bug, we need a good understanding of how futex work, especially the PI (Priority Inheritance Futex). There are very good papers that you can read about this topic. The first one is: Futexes Are Tricky, this will give a you an idea about what futex is and how it works. The other one is Requeue-PI: Making Glibc Condvars PI-Aware, this will give you a very thorough details about the PI futex implementation in the kernel. I suggest someone trying to implement the exploit to read at least the second paper.

For those of you who are not that curious to read the papers, I will try to simplify it: when there are tasks waiting for a pi futex, the kernel creates a priority list of those tasks (to be precise, it creates a waiter structure for that task) . A priority list is used because we want to maintain a property of a pi futex, i.e, task with a high priority will be waken first even though it waits after a low priority task.

The node of this priority list is stored in the kernel stack. Note that when a task waits for a futex, it will wait in kernel context, and will not return to user land. So the use of kernel stack here completely makes sense. Please note that before kernel 3.13, the kernel uses plist, but after 3.13 it uses rb_node for storing the list. If you want to exploit latest kernel, you will need to handle that.

This is actually where the bug is: there is a case where the waiter is still linked in the waiter list and the function returns. Please note that a kernel stack is completely separate from user stack. You can not influence kernel stack just by calling your own function in the userspace. You can manipulate kernel stack value (like the first example) by doing syscall.

If you don’t understand much about kernel, you can imagine a syscall is like a remote call to another machine: the execution on the other “machine” will use a separate stack from the one that you use in your local (user mode code). Note that this analogy doesn’t really hold when we talk about exploiting memory space.

A simpler bug

Before going into the detail about how the bug can be exploited in the kernel. I will present a much smaller and easier to understand userland code that has a similar bug. After you understand this, I will show how the kernel exploit works based on the same principle:

First, you will need to compile this in 32 bit mode, so the sizeof int is the same as the size of pointer (32 bit), don’t use any optimization when compiling.

gcc -m32 list1.c

Then run the resulting executable without any parameters it will print something like this:

$ ./mylist 
we will use pos: -1
Not a buggy function
Value = Alpha
Value = Beta
Value = --END OF LIST--
a buggy function, here is the location of value on stack 0xffd724a4
Value = Alpha
Value = Beta
Value = --END OF LIST--
location of buf[0] is 0xffd72488
location of buf[1] is 0xffd7248c
location of buf[2] is 0xffd72490
location of buf[3] is 0xffd72494
location of buf[4] is 0xffd72498
location of buf[5] is 0xffd7249c
location of buf[6] is 0xffd724a0
location of buf[7] is 0xffd724a4
location of buf[8] is 0xffd724a8
location of buf[9] is 0xffd724ac
Value = Alpha
Value = Beta
Value = --END OF LIST--

Notice the location of value on stack, which in this example is 0xffd724a4. Notice that it is the same as the address of buf[7]. Now run the test again, like this:

$./mylist 7 HACKED
we will use pos: 7
Not a buggy function
Value = Alpha
Value = Beta
Value = --END OF LIST--
a buggy function, here is the location of value on stack 0xffd3bd34
Value = Alpha
Value = Beta
Value = --END OF LIST--
Value = Alpha
Value = Beta
Value = HACKED

There is no assignment of the value “HACKED” to the list element, but it printed the word “HACKED”. Why this works? because we assign to the exact memory location in stack where “value” was stored. Also note that the location printed is now different because of ASLR, but because the position is the same in the stack, element number 7 is also relocated to the same address.

Now try out if you use other value than 7, also try a very small or very large value. You can also try this with stack protector on:

gcc -m32 -fstack-protector list1.c

The element that matched the address will be different (may be 7 becomes 8 or 6), but the exploit will still work if you adjust the element number with the matching address.

In this example, I made a very convenient function that makes the “exploit” easy (named a_function_to_exploit). In the real world, if we want to modify certain location in stack, we need to find a function that has a correct “depth” (that is: the stack size usage is at least the same as the function that we are trying to exploit), and we need to be able to manipulate the value on that stack depth. To understand about stack depth, you can comment the dummy_var1/dummy_var2/dummy_var3, compile, and see the stack address change. You can also see that if the function is optimized, certain variables are no longer in stack (moved to registers if possible).

Writing using list

Once you know how to manipulate an element of the list, you can write to certain memory address. On all exploits what you want to do is to write something to an address. To make this part short and to show my point, I will give an example for a simple linked list. If we have this:

Assume that we can control the content of n, we can write almost arbitrary value to arbitrary address. Lets assume that prev is in offset 4 of the node, and next is in offset 8. Please note that this structure assignment:

A->B = C;

is the same as:

*(A + offset_B) = C;

Lets assume we want to overwrite memory at location X with value Y. First prepare a fake node for the “next”, this fakenext is located at memory location Y (the location of the fake node is the value that we want to write), Of course this is limited to accessible memory space (so segmentation fault will not happen).

If we just want to change a value from 0 (for example: a variable containing number 0 if something is not allowed) to any non zero number (something is allowed), then we can use any number (we don’t even need a fake element, just a valid pointer to the “next” element).

n->next = fakenext;
n->prev = X-8

so when we call, this is what happens:

(n->next)->prev = n->prev;
(n->prev)->next = n->next

since n->next points to our fakenode

(fakenode)->prev = n->prev;
(n->prev)->next = fakenode

You can read more about this kind of list manipulation by searching google for heap exploits. For example, there are several articles about exploiting memory allocator in Phrack that uses list in the implementation (Vudo Malloc Tricks, Once upon a free, Advanced Doug lea’s malloc exploits and Malloc Des-Maleficarum).

These two things: that stack content can be manipulated, and that a manipulated list can be used to write to any address is the basis of the futex exploit.

On to the Kernel

Now lets go to where the bug is on the kernel.

You can also see the full code for futex_wait_requeue_pi().

In my simple userland code, there is a variable called instack that we want to manipulate, this time, we are interested in rt_waiter. In the case where futex requeue is called with uaddr1==uaddr2, the code path will cause q.rt_waiter to be NULL, but actually the waiter is still linked in the waiter list.

static int futex_requeue(u32 __user *uaddr1, unsigned int flags,u32 __user *uaddr2, int nr_wake, int nr_requeue,u32 *cmpval, int requeue_pi)

And the source code for futex_requeue()

Now we need a corresponding function for a_function_to_exploit. This syscall must be “deep” enough to touch the rt_waiter, so a syscall that doesn’t have a local variable, and doesn’t call other function is not usable.

First lets examine the function when we call it. From now on to show certain things, I will show how it looks in gdb that is connected to qemu for kernel debugging. Since I want to show things about towelroot, I am using qemu with ARM kernel. I am using this official guide for building Android kernel combined with stack overflow answer. When compiling the kernel, don’t forget to enable debug symbol. I created an Android 4.3 AVD, and started the emulator with this:

~/adt-bundle-linux/sdk/tools/emulator-arm -show-kernel -kernel arch/arm/boot/zImage -avd joeavd -no-boot-anim -no-skin -no-audio -no-window -logcat *:v -qemu -monitor telnet::4444,server -s

I can control the virtual machine via telnet localhost 444, and debug using gdb:

$ arm-eabi-gdb vmlinux

To connect and start debugging:

(gdb) target remote :1234

Why do I use GDB? Because this is the easiest way to get size of structure and to know what optimizations the compiler did. Lets break on the futex_wait_requeue_pi:

Breakpoint 6, futex_wait_requeue_pi (uaddr=0xb6e8be68, flags=1, val=0, abs_time=0x0, bitset=4294967295, uaddr2=0xb6e8be6c) at kernel/futex.c:2266
2266    {
(gdb) list
2261     * <0 - On error
2262     */
2263    static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
2264                                     u32 val, ktime_t *abs_time, u32 bitset,
2265                                     u32 __user *uaddr2)
2266    {
2267            struct hrtimer_sleeper timeout, *to = NULL;
2268            struct rt_mutex_waiter rt_waiter;
2269            struct rt_mutex *pi_mutex = NULL;
2270            struct futex_hash_bucket *hb;
(gdb) list
2271            union futex_key key2 = FUTEX_KEY_INIT;
2272            struct futex_q q = futex_q_init;
2273            int res, ret;

We can see the structures

(gdb) ptype timeout
type = struct hrtimer_sleeper {
    struct hrtimer timer;
    struct task_struct *task;
}

(gdb) ptype rt_waiter
type = struct rt_mutex_waiter {
    struct plist_node list_entry;
    struct plist_node pi_list_entry;
    struct task_struct *task;
    struct rt_mutex *lock;
}

We can also check the backtrace of this function

(gdb) bt
#0  futex_wait_requeue_pi (uaddr=0xb6e8be68, flags=1, val=0, abs_time=0x0, bitset=4294967295, uaddr2=0xb6e8be6c) at kernel/futex.c:2266
#1  0xc0050e54 in do_futex (uaddr=0xb6e8be68, op=, val=0, timeout=, uaddr2=0xb6e8be6c, val2=0, val3=3068706412)
    at kernel/futex.c:2668
#2  0xc005100c in sys_futex (uaddr=0xb6e8be68, op=11, val=0, utime=, uaddr2=0xb6e8be6c, val3=0) at kernel/futex.c:2707
#3  0xc000d680 in ?? ()
#4  0xc000d680 in ?? ()

We can print out the size of the structures:

(gdb) print sizeof(rt_waiter)
$1 = 48
(gdb) print sizeof(timeout)
$2 = 56
(gdb) print sizeof(to)
$3 = 4
(gdb) print sizeof(q)
$4 = 56
(gdb) print sizeof(hb)
$5 = 4

We can look at the addresses of things, in this case, the variable pi_mutex is optimized as register, so we can’t access the variable address.

(gdb) print &key2
$6 = (union futex_key *) 0xd801de04
(gdb) print &timeout
$7 = (struct hrtimer_sleeper *) 0xd801de40
(gdb) print *&to
Can't take address of "to" which isn't an lvalue.
(gdb) print &rt_waiter
$8 = (struct rt_mutex_waiter *) 0xd801de10
(gdb) print &pi_mutex
Can't take address of "pi_mutex" which isn't an lvalue.
(gdb) print &hb
$9 = (struct futex_hash_bucket **) 0xd801ddf8
(gdb) print &q
$10 = (struct futex_q *) 0xd801de78

In the linux kernel source dode, there is a very convenient script that can measure stack usage of functions, unfortunately this doesn’t work very well on ARM, but here is an example output in i386.

objdump -d vmlinux | ./scripts/checkstack.pl


...
0xc10ff286 core_sys_select [vmlinux]:			296
0xc10ff4ac core_sys_select [vmlinux]:			296
...
0xc106a236 futex_wait_requeue_pi.constprop.21 [vmlinux]:212
0xc106a380 futex_wait_requeue_pi.constprop.21 [vmlinux]:212
...
0xc14a406b sys_recvfrom [vmlinux]:			180
0xc14a4165 sys_recvfrom [vmlinux]:			180
0xc14a4438 __sys_sendmmsg [vmlinux]:			180
0xc14a4524 __sys_sendmmsg [vmlinux]:			180
...

The numbers on the right shows the maximum stack usage that was accessed by that function. The rt_waiter is not the last variable on stack, so we don’t really need to go deeper than 212. The deeper the stack, the lower address value will be used, in our case, we can ignore the hashbucket, key2, q, that totals in 64 bytes. Any syscall that has a stack use of more than 212-64 is a candidate.

Learning from geohot’s exploit, he found that there are four very convenient functions that can be used, this is the fist value in towelroot modstring, sendmmsg, recvmmsg, sendmsg, and recvmsg (each corresponds to method 0-3 in his towelroot modstring).

Knowing the address of rt_waiter, lets see the kernel stack when towelroot calls sys_sendmmsg and look at the iovstack array.

(gdb) print &iovstack[0]
$17 = (struct iovec *) 0xd801ddf0
(gdb) print &iovstack[1]
$18 = (struct iovec *) 0xd801ddf8
(gdb) print &iovstack[2]
$19 = (struct iovec *) 0xd801de00
(gdb) print &iovstack[3]
$20 = (struct iovec *) 0xd801de08
(gdb) print &iovstack[4]
$21 = (struct iovec *) 0xd801de10

Look at that, the iovstack[4] is in the same address as rt_waiter

(gdb) print iovstack
$22 = {{iov_base = 0xa0000800, iov_len = 125}, {iov_base = 0xa0000800, iov_len = 125}, {iov_base = 0xa0000800, iov_len = 125}, {
    iov_base = 0xa0000800, iov_len = 125}, {iov_base = 0xa0000800, iov_len = 1050624}, {iov_base = 0xa0000800, iov_len = 125}, {
    iov_base = 0xa0000800, iov_len = 125}, {iov_base = 0xa0000800, iov_len = 125}}

And now lets see the iov as struct rt_mutex_waiter

(gdb) print *(struct rt_mutex_waiter*)&iovstack[4]
$23 = {list_entry = {prio = -1610610688, prio_list = {next = 0x100800, prev = 0xa0000800}, node_list = {next = 0x7d, prev = 0xa0000800}},
  pi_list_entry = {prio = 125, prio_list = {next = 0xa0000800, prev = 0x7d}, node_list = {next = 0xa0000800, prev = 0xa0000800}},
  task = 0xa0000800, lock = 0xa0000800}

Of course this function can return immediately when the data is received on the other side. So what towelroot do is this: create a thread that will accept a connection in localhost, after it accept()s it, it never reads the data, so the sendmmsg call will be just hanging there waiting for the data to be sent. So the receiver thread looks like this:

bind()
listen()
while (1) {
      s = accept();
      log("i have a client like hookers");
}

As you can see in the simplest list example, that a compiler optimization can cause the address to change slightly, so the other parameter in the modstring is the hit_iov, so in the towelroot code, it looks something like this:

for (i =0 ;i < 8; i++) {
  iov[i].iov_base = (void *)0xa0000800;
  iov[i].iov_len = 0x7d;
  if (i==TARGET_IOV) {
     iov[i].iov_len = 0x100800;
  }
}

The other parameter related to iov is the align. I am not completely sure about this and when this is needed. From my observation, it sets all iov_len to 0x100800.

On the other thread, basically what towelroot does is this:

void sender_thread(){
 setpriority(N);
 connect(to_the_listener);
 futex_wait_requeue_pi(...);
 sendmmsg(...) ;
}

Unfortunately, having multi core can ruin things, so to make things safe, before starting, towelroot will set the process affinity so that this process will be run on only one core.

The detail of manipulating the waiter priority list is left to the reader (the objective is to write to a memory address) , but I can give you some pointers: to add a new waiter, use FUTEX_LOCK_PI, and to control where the item will be put, call setpriority prior to waiting. The baseline value is 120 (for nice value 0), so if you set priority 12 using setpriority, you will see the priority as 132 in the kernel land. The total line of plist.h and plist.c is only around 500 lines, and you only need to go to detail for plist_add and plist_del. Depending on what we want to overwrite, we don’t even need to be able to set to a specific value.

To modify the list, you will need multiple threads with different priorities. To be sure that the threads you started is really waiting inside a syscall (there will be time from creating the thread calling the syscall until it waits inside the syscall), you can use the trick that is used by towelroot: it reads the /proc/PID/task/TID/status of the Task with TID that you want to check. When the process is inside a syscall, the voluntary_ctxt_switches will keep on increasing (and the nonvoluntary_ctxt_switches will stay, but towelroot doesn’t check this). A voluntary context switch happens when a task is waiting inside a syscall.

Usually what people do when exploiting the kernel is to write to a function pointer. We can get the location of where these pointers are stored from reading /proc/kallsysms, or by looking at System.map generated when compiling the kernel. Both is usually not (easily) available in latest Android (this is one of the security mitigations introduced in Android 4.1). You may be able to get the address by recompiling an exact same kernel image using the same compiler and kernel configuration. On most PC distributions, you can find the symbols easily (via System.map and /proc/kallsyms is not restricted.

Assume that you can somehow get the address of a function pointer to overwrite, we can write a function that sets the current process credential to have root access and redirect so that our function is called instead of the original. But there is another way to change the process credential without writing any code that runs in kernel mode, just by writing to kernel memory. You can read the full presentation here, in the next part, I will only discuss the part needed to implement the exploit.

The Kernel Stack

Every thread is assigned a kernel stack space, and part of the stack space contains thread_info for that task. The stack address is different for every thread (the size is 8KB/thread in ARM) and you can not predict the address. So there will be a bunch of 8 KB stack blocks allocated for every thread. The thread_info is stored in the stack, in the lower address (stacks begins from top/high address). This thread_info contains information such as the pointer to task_struct. Inside a syscall, the task_struct for current thread is accessible by using the current macro:

#define get_current() (current_thread_info()->task)
#define current get_current()

In ARM, current_thread_info() is defined as:

#define THREAD_SIZE             8192
static inline struct thread_info *current_thread_info(void){
 register unsigned long sp asm ("sp");
 return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
}

Or if you want a constant number, the thread_info is located here $sp & 0xffffe000. Here is an example of task info when I am inside a syscall:

(gdb) print  *((struct thread_info*)(((uint32_t)$sp & 0xffffe000)))
$23 = {flags = 0, preempt_count = 0, addr_limit = 3204448256, task = 0xd839d000, exec_domain = 0xc047e9ec, cpu = 0, cpu_domain = 21, cpu_context = {
    r4 = 3626094784, r5 = 3627667456, r6 = 3225947936, r7 = 3626094784, r8 = 3561353216, r9 = 3563364352, sl = 3563364352, fp = 3563372460, 
    sp = 3563372408, pc = 3224755408, extra = {0, 0}}, syscall = 0, used_cp = "\000\000\000\000\000\000\000\000\000\000\001\001\000\000\000", 
  tp_value = 3066633984, crunchstate = {mvdx = {{0, 0} }, mvax = {{0, 0, 0}, {0, 0, 0}, {0, 0, 0}, {0, 0, 0}}, dspsc = {0, 0}}, 
  fpstate = {hard = {save = {0 }}, soft = {save = {0 }}}, vfpstate = {hard = {fpregs = {334529072224223236, 0, 
        0, 0, 0, 4740737484919406592, 4562254508917369340, 4724999426224703930, 0 }, fpexc = 1073741824, fpscr = 16, 
      fpinst = 3725516880, fpinst2 = 2816}}, restart_block = {fn = 0xc0028ca8 , {futex = {uaddr = 0x0, val = 0, flags = 0, 
        bitset = 0, time = 0, uaddr2 = 0x0}, nanosleep = {clockid = 0, rmtp = 0x0, expires = 0}, poll = {ufds = 0x0, nfds = 0, has_timeout = 0, 
        tv_sec = 0, tv_nsec = 0}}}}

Before continuing, you need to get a feeling of the memory mapping. This document doesn’t help much, but it gives an idea. Since every kernel can be compiled to have a different mapping, lets assume a common mapping for 32 bit ARM kernel:

a very low memory address (0x0-0x1000) is restricted for mapping (for security purpose)
address 0x1000-0xbf000000 is the user space
around 0xc0000000- 0xcffffff is where the kernel code location
area around 0xdxxxxxxx-0xfefffffe is where the kernel data is (stack and heap)
high memory ranges (0xfeffffff-0xffffffff) is reserved by the kernel.

The addr_limit is a nice target for overwrite. The default value for ARM is 3204448256 (0xbf000000), this value is checked in every operation in kernel that copy values from and to user space. You can read more about this in this post (Linux kernel user to kernel space range checks). The addr_limit is per task (every thread_info can have a different limit).

Area below addr_limit is considered to be a valid memory space that user space process can pass to kernel as parameter. Just to be clear here: if we modify addr_limit, it doesn’t mean that the user space can suddenly access memory at kernel space (for example, you can’t just dereference an absolute memory like this: *(uint32_t*)0xc0000000 to access kernel space from user space). The kernel can always read all the memory, so what this limit does is to make sure that when a user space gives a parameter to a syscall, the address given must be in user space. For example, kernel will not allow write(fd, addr, len) if addr is above addr_limit.

If we can somehow increase this limit, we can then read and modify kernel structures (using read/write syscall). Using a list manipulation, we can overwrite an address, and the location of addr_limit the one we want to overwrite.

Once we overwrite the addr_limit (in arm it is located at offset 8 in thread_info after flags and preempt_count), we can then (from userspace) read the address of the task_struct field, then from there, read the address of our credential (cred) field, then write/set our uid/gid to 0, and spawn a root shell (or do anything that you like, for example: just chown root/chmod suid certain file).

Kernel stack location

So basically we have two problems here: how to find the address to overwrite, and how to overwrite it. The part about “how to overwrite” is done by the list manipulation. The part about finding the address to overwrite: use other kernel vulnerability to leak kernel memory. There are a lot of bugs in the Linux kernel where it leaks memory to userspace (here is all of them in that category, there are 26 this year, and 194 in total), not all of these bugs can be used to leak the location of the stack.

Because the thread_info is always located at the beginning of stack, we can always find it if we know any address that is located in the stack, (just use addr & 0xffffe000). So we don’t really care about the exact leaked address. In the kernel space, there are a lot of code where a stack variable points to another variable. Just for an example, here is a code in futex.c that does this:

q.rt_waiter = &rt_waiter;
q.requeue_pi_key = &key2;

Both rt_waiter and key2 are located in the stack. If there is a code in the kernel that copies data to user space from uninitialized data on stack it, then will use whatever previous values that was on that stack, this is what we call an information leak. For a specific kernel version with a specific compiler we can get a reliable address, but with different kernel version and different compiler version (and optimization), most of the time this is not very reliable (it really depends on the previous usage of the stack and the stack layout created by the compiler). We can check for the leaked value, if we see a value above the addr_limit, what we get is a possible stack location.

Towelroot uses CVE-2013-2141 which is still in most Android kernel. You can check your kernel if it is affected by looking at the patch that corresponds to that bug (just check if info is initialized like this: struct siginfo info = {}). You can experiment a little bit to get a (mostly) reliable kernel stack address leak. This is the value that is printed by towelroot (“xxxx is a good number”). Please note that this is not 100% reliable, so sometimes towelroot will try to modify invalid address (it doesn’t seem to check if the address is above bf000000). Lets say we find the (possible) memory and name this POSSIBLE_STACK from now on.

Update: I was wrong, towelroot uses the stack address from the unlinked waiter address. The waiter is unlinked because of the tgkill() call. So the address should always be valid.

Overwrite

Ideally we want to be able to read the whole kernel space, but we can start small, increase our limit a little bit. You may have noticed several things by now: the rt_waiter is stored in the stack (lets say in location X), the addr_limit is also stored in stack (lets say Y), the X location is going to be always greater than Y for that thread. We know that the thread_info is always located at sp & 0xffffe000, so we have POSSIBLE_THREAD_INFO = (POSSIBLE_STACK & 0xffffe000). So if we can put the address of any rt_waiter from other tasks and write it to (POSSIBLE_THREAD_INFO + 8), we have increased the stack limit from bf000000 to some value (usually dxxxxxxx).

I am not entirely sure (since I don’t have a samsung S5), but it seems that the addr_limit is not always in offset 8, so towelroot have limit_offset to fix the address.

How to know if we have succesfully changed the address limit for a task? From inside that task, we can use the write syscall. For example, we can try: write(fd, (void *)0xc0000000, 4) . What we are trying to do is to write the content of the memory address to a file descriptor (you can replace 0xc000000 with any address above 0xbf000000). If we can do this successfuly, then we can continue because our limit has been changed.

On my experiment, the memory leak is sometimes very predictable (it leaks always a certain task kernel address, but not always). If we know exactly which thread address we modify, we can ask that thread to continue our work, otherwise, we need to ask all the threads that we have (for example by sending a signal): can you check if the address limit have been changed for you? If a thread can read the kernel memory address, then we know the address of that thread’s thread_info, we can then read and write to POSSIBLE_THREAD_INFO + 8. First thing that we want to do is to change *(POSSIBLE_THREAD_INFO + 8) = 0xffffffff. Now this thread can read and write to anywhere.

Here is an example where a thread’s addr_limit has been successfully changed to 0xffffffff:

(gdb) print  *((struct thread_info*)(((uint32_t)$sp & 0xffffe000)))
$22 = {flags = 0, preempt_count = 0, addr_limit = 4294967295, task = 0xd45d6000, exec_domain = 0xc047e9ec, cpu = 0, cpu_domain = 21, cpu_context = {
    r4 = 3561369728, r5 = 3562889216, r6 = 3225947936, r7 = 3561369728, r8 = 3562677248, r9 = 3069088276, sl = 3561955328, fp = 3561963004, 
    sp = 3561962952, pc = 3224755408, extra = {0, 0}}, syscall = 0, used_cp = "\000\000\000\000\000\000\000\000\000\000\001\001\000\000\000", 
  tp_value = 3061055232, crunchstate = {mvdx = {{0, 0} }, mvax = {{0, 0, 0}, {0, 0, 0}, {0, 0, 0}, {0, 0, 0}}, dspsc = {0, 0}}, 
  fpstate = {hard = {save = {0 }}, soft = {save = {0 }}}, vfpstate = {hard = {fpregs = {1067681969331808106, 0, 
        0, 0, 0, 4738224773140578304, 4562254508917369340, 4732617274506747052, 0 }, fpexc = 1073741824, fpscr = 16, 
      fpinst = 3725475920, fpinst2 = 2816}}, restart_block = {fn = 0xc0028ca8 , {futex = {uaddr = 0x0, val = 0, flags = 0, 
        bitset = 0, time = 0, uaddr2 = 0x0}, nanosleep = {clockid = 0, rmtp = 0x0, expires = 0}, poll = {ufds = 0x0, nfds = 0, has_timeout = 0, 
        tv_sec = 0, tv_nsec = 0}}}}

Once you can read/write anywhere, you can do anything, its game over. First you may want to fix the plist so that it will not crash (something that was not done by towelroot v1), although sometimes in towelroot v3, it stops because it was unable to fix the list even though it can do the rooting process. Even though you can read/write using file, to make it easy you can use pipe, and write/read to/from that pipe.

If we have a pipe, with rfd as read descriptor in the side of the pipe and wfd as the write descriptor, we can do this: To read a memory from kernel: do a write(wfd, KERNEL_ADDR, size), and read the result in read(rfd, LOCAL_ADDRESS, size). To write to kernel memory: do a write(wfd, LOCAL_ADDRESS, size), and then read(rfd, KERNEL_ADDRESS, size).

If you want to play around without creating exploit, you can also try out the addr_limit by setting the value manually in your debugger.

The last modstring in towelroot is temp_root which is not related to the exploit itself, it only creates temporary root for devices that have non writeable /system.

So thats all there is to it. I have shown you where the bug is, which syscall that you can use to manipulate the stack, how to write to arbitrary address (although not in detail, but with enough pointers), what to write there, and what to do after you write there.

Implementing a web server in a single printf() call

A guy just forwarded a joke that most of us will already know Jeff Dean Facts (also here and here). Everytime I read that list, this part stands out:

Jeff Dean once implemented a web server in a single printf() call. Other engineers added thousands of lines of explanatory comments but still don’t understand exactly how it works. Today that program is the front-end to Google Search.

It is really possible to implement a web server using a single printf call, but I haven’t found anyone doing it. So this time after reading the list, I decided to implement it. So here is the code, a pure single printf call, without any extra variables or macros (don’t worry, I will explain how to this code works)

#include <stdio.h>

int main(int argc, char *argv[])
{
 printf("%*c%hn%*c%hn"
  "\xeb\x3d\x48\x54\x54\x50\x2f\x31\x2e\x30\x20\x32"
  "\x30\x30\x0d\x0a\x43\x6f\x6e\x74\x65\x6e\x74\x2d"
  "\x74\x79\x70\x65\x3a\x74\x65\x78\x74\x2f\x68\x74"
  "\x6d\x6c\x0d\x0a\x0d\x0a\x3c\x68\x31\x3e\x48\x65"
  "\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21\x3c\x2f"
  "\x68\x31\x3e\x4c\x8d\x2d\xbc\xff\xff\xff\x48\x89"
  "\xe3\x48\x83\xeb\x10\x48\x31\xc0\x50\x66\xb8\x1f"
  "\x90\xc1\xe0\x10\xb0\x02\x50\x31\xd2\x31\xf6\xff"
  "\xc6\x89\xf7\xff\xc7\x31\xc0\xb0\x29\x0f\x05\x49"
  "\x89\xc2\x31\xd2\xb2\x10\x48\x89\xde\x89\xc7\x31"
  "\xc0\xb0\x31\x0f\x05\x31\xc0\xb0\x05\x89\xc6\x4c"
  "\x89\xd0\x89\xc7\x31\xc0\xb0\x32\x0f\x05\x31\xd2"
  "\x31\xf6\x4c\x89\xd0\x89\xc7\x31\xc0\xb0\x2b\x0f"
  "\x05\x49\x89\xc4\x48\x31\xd2\xb2\x3d\x4c\x89\xee"
  "\x4c\x89\xe7\x31\xc0\xff\xc0\x0f\x05\x31\xf6\xff"
  "\xc6\xff\xc6\x4c\x89\xe7\x31\xc0\xb0\x30\x0f\x05"
  "\x4c\x89\xe7\x31\xc0\xb0\x03\x0f\x05\xeb\xc3",
  ((((unsigned long int)0x4005c8 + 12) >> 16) & 0xffff), 
  0, 0x00000000006007D8 + 2, 
  (((unsigned long int)0x4005c8 + 12) & 0xffff)-
  ((((unsigned long int)0x4005c8 + 12) >> 16) & 0xffff), 
  0, 0x00000000006007D8 );
}

This code only works on a Linux AMD64 bit system, with a particular compiler (gcc version 4.8.2 (Debian 4.8.2-16) ) And to compile it:

gcc -g web1.c -O webserver

As some of you may have guessed: I cheated by using a special format string . That code may not run on your machine because I have hardcoded two addresses.

The following version is a little bit more user friendly (easier to change), but you are still going to need to change 2 values: FUNCTION_ADDR and DESTADDR which I will explain later:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define FUNCTION_ADDR ((uint64_t)0x4005c8 + 12)
#define DESTADDR 0x00000000006007D8
#define a (FUNCTION_ADDR & 0xffff)
#define b ((FUNCTION_ADDR >> 16) & 0xffff)

int main(int argc, char *argv[])
{
	printf("%*c%hn%*c%hn"
		"\xeb\x3d\x48\x54\x54\x50\x2f\x31\x2e\x30\x20\x32"
		"\x30\x30\x0d\x0a\x43\x6f\x6e\x74\x65\x6e\x74\x2d"
		"\x74\x79\x70\x65\x3a\x74\x65\x78\x74\x2f\x68\x74"
		"\x6d\x6c\x0d\x0a\x0d\x0a\x3c\x68\x31\x3e\x48\x65"
		"\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21\x3c\x2f"
		"\x68\x31\x3e\x4c\x8d\x2d\xbc\xff\xff\xff\x48\x89"
		"\xe3\x48\x83\xeb\x10\x48\x31\xc0\x50\x66\xb8\x1f"
		"\x90\xc1\xe0\x10\xb0\x02\x50\x31\xd2\x31\xf6\xff"
		"\xc6\x89\xf7\xff\xc7\x31\xc0\xb0\x29\x0f\x05\x49"
		"\x89\xc2\x31\xd2\xb2\x10\x48\x89\xde\x89\xc7\x31"
		"\xc0\xb0\x31\x0f\x05\x31\xc0\xb0\x05\x89\xc6\x4c"
		"\x89\xd0\x89\xc7\x31\xc0\xb0\x32\x0f\x05\x31\xd2"
		"\x31\xf6\x4c\x89\xd0\x89\xc7\x31\xc0\xb0\x2b\x0f"
		"\x05\x49\x89\xc4\x48\x31\xd2\xb2\x3d\x4c\x89\xee"
		"\x4c\x89\xe7\x31\xc0\xff\xc0\x0f\x05\x31\xf6\xff"
		"\xc6\xff\xc6\x4c\x89\xe7\x31\xc0\xb0\x30\x0f\x05"
		"\x4c\x89\xe7\x31\xc0\xb0\x03\x0f\x05\xeb\xc3"
	, b, 0, DESTADDR + 2, a-b, 0, DESTADDR );
}

I will explain how the code works through a series of short C codes. The first one is a code that will explain how that we can start another code without function call. See this simple code:

#include <stdlib.h>
#include <stdio.h>

#define ADDR 0x00000000600720

void hello()
{
        printf("hello world\n");
}

int main(int argc, char *argv[])
{
        (*((unsigned long int*)ADDR))= (unsigned long int)hello;
}

You can compile it, but it many not run on your system. You need to do these steps:

1. Compile the code:

gcc run-finalizer.c -o run-finalizer

2. Examine the address of fini_array

objdump -h -j .fini_array run-finalizer

And find the VMA of it:

run-finalizer:     file format elf64-x86-64
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 18 .fini_array   00000008  0000000000600720  0000000000600720  00000720  2**3
                  CONTENTS, ALLOC, LOAD, DATA

Note that you need a recent GCC to do this, older version of gcc uses different mechanism of storing finalizers.

3. Change the value of ADDR on the code to the correct address

4. Compile the code again

5. Run it

and now you will see “hello world” printed to your screen. How does this work exactly?:

According to Chapter 11 of Linux Standard Base Core Specification 3.1

.fini_array
This section holds an array of function pointers that contributes to a single termination array for the executable or shared object containing the section.

We are overwriting the array so that our hello function is called instead of the default handler. If you are trying to compile the webserver code, the value of ADDR is obtained the same way (using objdump).

Ok, now we know how to execute a function by overriding a certain address, we need to know how we can overwrite an address using printf. You can find many tutorials on how to exploit format string bugs, but I will try give a short explanation.

The printf function has this feature that enables us to know how many characters has been printed using the “%n” format:

#include <stdio.h>
int main(){
        int count;
        printf("AB%n", &count);
        printf("\n%d characters printed\n", count);
}

You will see that the output is:

AB
2 characters printed

Of course we can put any address to the count pointer to overwrite that address. But to overide an address with a large value we need to print a large amount of text. Fortunately there is another format string “%hn” that works on short instead of int. We can overwrite the value 2 bytes at a time to form the 4 byte value that we want.

Lets try to use two printf calls to put a¡ value that we want (in this case the pointer to function “hello”) to the fini_array:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define FUNCTION_ADDR ((uint64_t)hello)
#define DESTADDR 0x0000000000600948

void hello()
{
	printf("\n\n\n\nhello world\n\n");
}

int main(int argc, char *argv[])
{
	short a= FUNCTION_ADDR & 0xffff;
	short b = (FUNCTION_ADDR >> 16) & 0xffff;
	printf("a = %04x b = %04x\n", a, b)
        uint64_t *p = (uint64_t*)DESTADDR;
        printf("before: %08lx\n", *p);
	printf("%*c%hn", b, 0, DESTADDR + 2 );
        printf("after1: %08lx\n", *p); 
	printf("%*c%hn", a, 0, DESTADDR);
        printf("after2: %08lx\n", *p);
	return 0;
}

The important lines are:

	short a= FUNCTION_ADDR & 0xffff;
	short b = (FUNCTION_ADDR >> 16) & 0xffff;
	printf("%*c%hn", b, 0, DESTADDR + 2 );
	printf("%*c%hn", a, 0, DESTADDR);

The a and b are just halves of the function address, we can construct a string of length a and b to be given to printf, but I chose to use the “%*” formatting which will control the length of the output through parameter.

For example, this code:

   printf("%*c", 10, 'A');

Will print 9 spaces followed by A, so in total, 10 characters will be printed.

If we want to use just one printf, we need to take account that b bytes have been printed, and we need to print another b-a bytes (the counter is accumulative).

  printf("%*c%hn%*c%hn", b, 0, DESTADDR + 2, b-a, 0, DESTADDR );

Currently we are using the “hello” function to call, but we can call any function (or any address). I have written a shellcode that acts as a web server that just prints “Hello world”. This is the shell code that I made:

unsigned char hello[] = 
		"\xeb\x3d\x48\x54\x54\x50\x2f\x31\x2e\x30\x20\x32"
		"\x30\x30\x0d\x0a\x43\x6f\x6e\x74\x65\x6e\x74\x2d"
		"\x74\x79\x70\x65\x3a\x74\x65\x78\x74\x2f\x68\x74"
		"\x6d\x6c\x0d\x0a\x0d\x0a\x3c\x68\x31\x3e\x48\x65"
		"\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21\x3c\x2f"
		"\x68\x31\x3e\x4c\x8d\x2d\xbc\xff\xff\xff\x48\x89"
		"\xe3\x48\x83\xeb\x10\x48\x31\xc0\x50\x66\xb8\x1f"
		"\x90\xc1\xe0\x10\xb0\x02\x50\x31\xd2\x31\xf6\xff"
		"\xc6\x89\xf7\xff\xc7\x31\xc0\xb0\x29\x0f\x05\x49"
		"\x89\xc2\x31\xd2\xb2\x10\x48\x89\xde\x89\xc7\x31"
		"\xc0\xb0\x31\x0f\x05\x31\xc0\xb0\x05\x89\xc6\x4c"
		"\x89\xd0\x89\xc7\x31\xc0\xb0\x32\x0f\x05\x31\xd2"
		"\x31\xf6\x4c\x89\xd0\x89\xc7\x31\xc0\xb0\x2b\x0f"
		"\x05\x49\x89\xc4\x48\x31\xd2\xb2\x3d\x4c\x89\xee"
		"\x4c\x89\xe7\x31\xc0\xff\xc0\x0f\x05\x31\xf6\xff"
		"\xc6\xff\xc6\x4c\x89\xe7\x31\xc0\xb0\x30\x0f\x05"
		"\x4c\x89\xe7\x31\xc0\xb0\x03\x0f\x05\xeb\xc3";

If we remove the function hello and insert that shell code, that code will be called.

That code is just a string, so we can append it to the “%*c%hn%*c%hn” format string. This string is unnamed, so we will need to find the address after we compile it. To obtain the address, we need to compile the code, then disassemble it:

objdump -d webserver

00000000004004fd <main>:
  4004fd:	55                   	push   %rbp
  4004fe:	48 89 e5             	mov    %rsp,%rbp
  400501:	48 83 ec 20          	sub    $0x20,%rsp
  400505:	89 7d fc             	mov    %edi,-0x4(%rbp)
  400508:	48 89 75 f0          	mov    %rsi,-0x10(%rbp)
  40050c:	c7 04 24 d8 07 60 00 	movl   $0x6007d8,(%rsp)
  400513:	41 b9 00 00 00 00    	mov    $0x0,%r9d
  400519:	41 b8 94 05 00 00    	mov    $0x594,%r8d
  40051f:	b9 da 07 60 00       	mov    $0x6007da,%ecx
  400524:	ba 00 00 00 00       	mov    $0x0,%edx
  400529:	be 40 00 00 00       	mov    $0x40,%esi
  40052e:	bf c8 05 40 00       	mov    $0x4005c8,%edi
  400533:	b8 00 00 00 00       	mov    $0x0,%eax
  400538:	e8 a3 fe ff ff       	callq  4003e0 <printf@plt>
  40053d:	c9                   	leaveq 
  40053e:	c3                   	retq   
  40053f:	90                   	nop

We only need to care about this line:

mov    $0x4005c8,%edi

That is the address that we need in:

#define FUNCTION_ADDR ((uint64_t)0x4005c8 + 12)

The +12 is needed because our shell code starts after the string “%*c%hn%*c%hn” which is 12 characters long.

If you are curious about the shell code, it was created from the following C code.

#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#include<unistd.h>
#include<sys/types.h>
#include<sys/stat.h>
#include<sys/socket.h>
#include<arpa/inet.h>
#include<netdb.h>
#include<signal.h>
#include<fcntl.h>

int main(int argc, char *argv[])
{
	int sockfd = socket(AF_INET, SOCK_STREAM, 0);
	struct sockaddr_in serv_addr;
	bzero((char *)&serv_addr, sizeof(serv_addr));
        serv_addr.sin_family = AF_INET;
        serv_addr.sin_addr.s_addr = INADDR_ANY;
        serv_addr.sin_port = htons(8080);
	bind(sockfd, (struct sockaddr *)&serv_addr, sizeof(serv_addr));
	listen(sockfd, 5);
	while (1) {
		int cfd  = accept(sockfd, 0, 0);
		char *s = "HTTP/1.0 200\r\nContent-type:text/html\r\n\r\n<h1>Hello world!</h1>"; 
		if (fork()==0) {
			write(cfd, s, strlen(s));
			shutdown(cfd, SHUT_RDWR);
			close(cfd);
		}	
	}

	return 0;
}

I have done an extra effort (although it is not really necessary in this case) to remove all NUL character from the shell code (since I couldn’t find one for X86-64 in the Shellcodes database).

Jeff Dean once implemented a web server in a single printf() call. Other engineers added thousands of lines of explanatory comments but still don’t understand exactly how it works. Today that program is the front-end to Google Search.

It is left as an exercise for the reader to scale the web server to able to handle Google search load.

Source codes for this post is available at https://github.com/yohanes/printf-webserver

For people who thinks that this is useless: yes it is useless. I just happen to like this challenge, and it has refreshed my memory and knowledge for the following topics: shell code writing (haven’t done this in years), AMD64 assembly (calling convention, preserved registers, etc), syscalls, objdump, fini_array (last time I checked, gcc still used .dtors), printf format exploiting, gdb tricks (like writing memory block to file), and low level socket code (I have been using boost’s for the past few years).

Update: Ubuntu adds a security feature that provides a read-only relocation table area in the final ELF. To be able to run the examples in ubuntu, add this in the command line when compiling

-Wl,-z,norelro

e.g:

gcc -Wl,-z,norelro test.c

Raspberry Pi for Out of Band Linux PC management

Just a day before I left to Indonesia for my brother’s wedding, I got worried about my headless Linux PC server: it may freeze when I left it. It happened before because of kernel panic and hardware error, it can happen again. I want to be able to reset the PC in case of errors and to power it down in case the error was not recoverable (for example last year my disk drive went bad).

Just a note before reading this: in case you just want to turn on or off a PC using raspberry PI: just use wake on LAN (WOL) to turn on your PC and SSH access to turn it off. Wake on LAN works most of the time, but it can not handle a PC that is not responding.

I soldered 2 optocouplers that I have (4N25) to a small perfboard with a pin header. Then I use solderless breadboard cables to connect the board to Raspberry Pi, and to the PC power and reset button (so I can manually turn on/off using the power/reset button on the PC).

4n25

Watch carefully about the + and – on the motherboard (PW is for power, and RES is for reset, note the polarity is important for the optocoupler, on the picture above: Red goes to + and Black goes to -):

I could just have used one optocoupler to connect to the power button (the reset is not really necessary, because we can turn off and on the PC again to reset), but I just want to use the extra 4N25 that I have (it’s really cheap, 5.5 baht or around 17 cents USD).

To reset the PC, I just set the GPIO pin to high for about one second, then set it low again. To power up the PC, I set the GPIO pin to high for about 5 seconds, and set it low again (the same can be used to forcefully power down the PC).

Resetting and powering the PC is easy, the next task is to know what happened if the PC crashed. To do this, I need a serial connection. If my PC has a serial port and I have a USB to serial cable, then everything will be much easier, but since I don’t have a USB to serial cable, and my PC doesn’t have a serial port on the back, it gets a bit complicated.

I still have a small board based on MAX3232CPE to convert from 3.3V serial to 5.0 V, so I plugged that board to Raspberry Pi and connected it directly to the PC motherboard. This page helped me in finding the pin names (I only need to connect RX, TX and GND).

On the Raspberry side, I need to set up so that it will not use the serial port for kernel output and log in. You can follow the guide here.

On the PC side, I need to activate serial output in three places: to accept login (getty), to get the kernel output (kernel parameters in grub), and in grub itself (to show the boot selection dialog). This guide for Debian works for me, but I was not able to see the GRUB output on screen when I connected my screen (I can only see the output on my serial console, but this was not a problem for me).

I experimented a little bit with SGABios Hoping that I would be able to see BIOS output from my serial port. It didn’t work as expected. I can not see the initial BIOS screen, and I can not send a key to enter BIOS setting, but If I connect a keyboard and press a button to enter the BIOS, I can see the BIOS menu via serial port and I can interact with it.

Here are the steps that I tried to get the BIOS serial output: I downloaded the BIOS for my motherboard (an AWARD BIOS). Then on a windows machine, I modified the BIOS using CBROM cbrom bios.bin /isa sgabios.bin. Then I flashed the BIOS from Linux using Flashrom.

I didn’t solve the BIOS problem due to time constraint. There are several solutions that I can think of to solve this: one is to use CoreBoot (but unfortunately my motherboard is not supported by coreboot), another one is to try to do more hacking on the BIOS (maybe removing the VGA ROM to force the output to serial port) and the other one is to simulate a keypress to enter BIOS. The first two methods may not be portable across BIOS, but the last one should be portable. The key simulation can be done by simulating a PS2 device (using bitbanging on Raspberry Pi), or USB HID device. A super simple USB HID device can be made by using V-USB library (you can see this as an example).

Just a few hours before I left, I have an idea to connect a temperature sensor just to see if the temperature around the PC case is too high. We are entering the summer here in Chiang Mai and the outside temperature is getting higher every day (from November to beginning of February, the temperature was around 8-20 Celcius, and now it is around 17-37 Celcius). It was quite easy to add the temperature sensor, I just use the guide and driver from adafruit. Next time I may add an infrared LED on the Raspberry Pi to turn on the Air Conditioner when it gets really hot.

Having everything setup: nothing happened while I was away. The PC was running nicely (and I can access the PC via SSH and the serial console).

RFID based toy/game for toddlers

Hardware

Inspired by this toy fromm LeapFrog that we got for free on a yard sale, I made this toy for my son:

This is a simple toy, he can pick a card from this set of alphabet cards:

And put it above this device:

And then the alphabet will be shown on the screen:

the alphabet is received by Raspberry Pi via bluetooth and displayed through HDMI:

Why wireless? I want to have distance between the device and the TV screen. I could have just used cables (I can even plug the RFID reader directly to Raspberry Pi), but it is not toddler-safe. My son would occasionally run to the TV screen to point at something, and I don’t want him to trip on the wires.

The implementation is quite simple. The cards are actually RFID cards (50 cards for $11.47), and it is read using this cheap 9.9 USD RFID reader. To make the card looks good, my wife prints the letters of the alphabet to a sticker paper and sticks them to the cards.

And to make the data available to the Raspberry Pi, I used the same Bluetooth module as the one I used in my previous post (you can find a similar one here). The baud rate for the RFID reader is 9600 bps, so we need to set the same baud rate for the Bluetooth module.

For the power source, I could have used AA batteries, but I have this USB powerbank (that also acts as a USB/Wifi router) that I don’t use very often:

I didn’t do any soldering for this project, I used a breadboard

And in case you are wondering, I just use this device (5v to 3.3v serial converter) to connect the USB power to breadboard (just because I don’t want to solder anything, and this device fits nicely):

Software

For the software part, I wrote a small python script that uses pygame.

To prepare the raspberry pi to run the app, you need to install these packages:

sudo apt-get install python-pygame python-serial bluez-utils sox

Then find the device bluetooth address using:

hcitool scan

Create a file named “pincodes” to enable automatic pairing:

echo "DEVICEADDRESS PIN" >> /var/lib/bluetooth/YOURMACHINEADDRESS/pincodes

The default device PIN is 1234. For example, this is what I do in my laptop:

echo "00:12:03:09:17:55 1234 >> /var/lib/bluetooth/E0\:B9\:A5\:45\:15\:1B/pincodes

And for the serial connection, create /etc/bluetooth/rfcomm.conf file:

#
# RFCOMM configuration file.
#

rfcomm0 {
	# Automatically bind the device at startup
	bind yes;

	# Bluetooth address of the device
	device DEVICEADDRESS;

	# RFCOMM channel for the connection
	channel	1;
}

You can checkout the source code at github:

git clone git://github.com/yohanes/rfid-abc.git

I don’t have a license to redistribute the wav files for the alphabet sound that I own, but fortunately you can find a collection of wav files from Voxeo site: http://evolution.voxeo.com/library/audio/prompts/alphabet/index.jsp (download audio-alphabet.zip)

To be usable in pygame, you need to convert the format to raw 44.1Khz WAV using sox:

cd rfid-abc
wget -c http://evolution.voxeo.com/library/audio/prompts/alphabet/audio-alphabet.zip
unzip audio-alphabet.zip -d original
cd original
for i in *.wav; do sox $i -r 44100 -e un ../$i; done
cd ..

And to run it:

python game.py

Oh wait, you need to edit the card id mapping in map.txt, in case you didn’t touch the file the app will store unknown card ids to “unknown.txt”.

Future improvements

The software is still very simple. I am planning to make it multilingual (my son needs to know Indonesian, English and Thai), and more interesting (for example: the computer can ask “find me the letter C” or it can be changed into a spelling game).

Adding Bluetooth Serial Port to Asus RT-N16

I am running DebWrt on my Asus RT-N16 and it works well. The only problem that I have is: in case I misconfigure something and the device is inaccessible via the network, I need to open the case then connect a serial port to fix it. Because the configuration is in USB, I don’t have to open the case very often, in most cases, I can unplug the USB disk, mount it in my Linux machine, try to fix the configuration, plug the USB again, restart the router, and hope that my fix works. Either way, both are such a hassle.

I could have added a serial port just like my DIR-300 mod, but I think it’s not the best solution. Because I still need to bring down my router, find my serial cable, plug it in and connect to it. I wish that the device has a Bluetooth capability, so I can connect to it (via Bluetooth serial port profile), fix any problem that it has, and without moving or plugging anything, and hopefully I don’t even need to restart the router and wait for it to boot.

So I bought a 7.32 USD bluetooth module from Aliexpress and installed it on my RT-N16. Some of you may think that it is a bad idea because Bluetooth interferes (somewhat) with WIFI, but I don’t plan to keep constant connection via Bluetooth, and when I do make the connection, the data that I am transferring is very small (maybe just several kilobytes per minutes). So far in my testing, when connected via Bluetooth, I didn’t notice any speed difference in WIFI transfer speed (even when transferring large files via WIFI) and typing furiously from my Bluetooth terminal. Asus RTN16 only supports 2.4 GHz, but If your router supports 5GHz, I think you should use that band to eliminate any possible interference.

Asus RT-N16 has a serial header ready to be connected (it even has labels on it, GND, RX, TX, VCC):

But before plugging in the module, I need to set the speed of the Bluetooth serial module to 115200, because the default speed is 9600. To set up the Bluetooth serial module, we need to connect it to a computer via serial port (I am using Bus Pirate for this).

My version of Bluetooth module is H-C07, and for this version, the device doesn’t use \r or \n to terminate command, it just uses time out to end a command (a complete command must be received within few hundreds millisecond). Typing very-very quickly in your terminal won’t work, so just copy and paste the command from your text editor. The command needed to set the Bluetooth module to 115200 is “AT+BAUD8”. These Bluetooth modules usually don’t come with documentation, so you need to look on the internet for your specific version.

One of the nice thing about Bluetooth is: it is accessible from non-PC devices. After connecting the cables, I can connect to my router using Android Bluetooth SPP

One thing to note: the Bluetooth module needs time to initialize, so it can not be used for accessing the bootloader. To restart the router, I need to plug and unplug the power cable. So the Bluetooth module will lose its connection when I restart the device. By the time the Bluetooth module is ready, the bootloader has already passed and you will be in the middle of Linux booting.

If you really want to use the Bluetooth module to access the bootloader, you will either need a separate power source for the Bluetooth module, or make a special reset button for the router (that doesn’t involve unplugging and plugging the device, and doesn’t cut power to the Bluetooth module).

LocalBar: Install signed BAR files directly from PlayBook

I’ve reverse engineered the protocol used by blackberry-deploy to install apps file (BAR file) into the playbook. Then I made an app to Install signed BAR files directly from the playbook itself. You can find my work here:

http://yohan.es/playbook/localbar/

I am using https://localhost method. To put it simply: it works like other desktop installers that connect via network or USB, it sends commands to an HTTP service in the playbook. The only difference is that it works through the playbook itself.
It is possible that in the future RIM may block requests from localhost
I don’t have time to develop nice GUI for this, so I just use the basic GUI API that is accessible using NDK. For example: in the NDK there is a “login dialog” but no “password dialog”, so for the password dialog I use the “login dialog” that shows the “user” field (which I don’t need).
This works on OS 1.0.7 and on 2.0 (developer beta)
With this you can sort of OTA install through the playbook. From your PlayBook Just go to a website that has some bar files (for example this forum) , download it using the built in playbook browser, then run LocalBar to install the downloaded bar files.

EZ430-Chronos OTP

After wanting the EZ430 Chronos watch for a long time, I finally ordered one on Febuary 20th from TI eStore, and I got the watch on February 24th (Tax Free). So this is another stuff in my long list of “things to hack”.

I had a good idea to use my Ez430 Chronos as OTP generator for Google 2 factor authentication. Before my long weekend, I did my research on Thursday (24 February) and that time no one had implemented it. So I wrote a small modification to OpenChronos, and just before I finished my implementation on Sunday (I was quite busy during the long weekend helping to move our company’s office), I looked at Chronos Wiki again to find some links to the chronos documentation, and found out that Huan Truong has just implemented his version of OTP by modifying OpenChronos.

After learning that in his version the clock function doesn’t work yet (in his readme it says “THIS FIRMWARE CURRENTLY HAS A YET-TO-IMPLEMENT CLOCK FUNCTIONALITY, SO IT WONT DISPLAY TIME PROPERLY”), I decided to continue my implementation. My implementation doesn’t change the time logic so you can still use the stock Control Center provided by TI (Huan Troung changed the OpenChronos code to use epoch implementation, and he modified the control center) . Instead of replacing all algorithms to use timestamp, I use a simple mktime implementation to convert existing year/month/date data to unix timestamp.

After flashing the image to the watch, a new menu is added to the second line after “rFbSL”, it will show a heart icon and first 2 digits of the OTP (I will never buy a heart monitor for this watch so I use that icon just to show that I am in OTP mode). Pressing the “#” key for a few seconds will show the remaining 4 digits. Just for your information, enabling CONFIG_OTP adds 2914 bytes to the code size.

So here is my version of Google OTP ~~(If many people are interested, I can put it in github):~~

http://tinyhack.com/files/OpenChronos-joe-otp.zip

I am too lazy to implement the “make config’, just edit otp.h with your key, and fill in the timezone offset (+N from UTC). You can get the key from base32 encoded string using codegen script that I made, for example:

bash$ python codegen.py pf xwqy lomvz wu 33f
\x79\x6f\x68\x61\x6e\x65\x73\x6a\x6f\x65

https://github.com/yohanes/OpenChronos

You can use make config to set your secret key in base32 (that means you can just copy paste from the auth code presented by Google), and you can set the timezone offset.

New Adventures

My last post was about 6 months ago. Now I am back with some new adventures. The first one is Jonathan, my first baby:

And the next one is BeagleBoard-xM from John Nicholls.

About a month ago I found a promotion and got this free MSP430 USB development tool:

It got me interested in MSP430 in general and bought some LaunchPad (only 4.30 usd each). My first project is to control the plug so i can plug and unplug BeagleBoard-xM through PC (so I can control it remotely via SSH). With this, I should be able to work on BeagleBoard remotely (like when I am in my room holding my baby boy).

And I have updated the CNS21XX code in my gitorious repo with the latest head. Hopefully I can put the code to SVN HEAD in the near future.

CNS21XX port completed

About six months ago, Stefan Bethke donated me some money to buy a device from dealextreme so I can port FreeBSD to that device (you can see the pictures here). This device uses ARM Cavium Econa CNS21XX (formerly known as STR8132). Within few days I have completed the driver for serial port, interrupt controller, EHCI/OHCI. Then I stopped working on it, three months later I continued and finished the network driver, then I stopped again.

The last part that wasn’t finished was the SPI controller and the SPI flash driver, so this weekend I spent some time to finish it. So now, I can say that the port is finished, all drivers have been written for the device. With SPI flash support, I can now write the kernel to the device, and boot it from there (I don’t need to boot from network anymore).

Actually, I am not really finished yet, since I still need to reformat the code according to the FreeBSD standard, and there might still be bugs in my code, so I invite everyone that has this device to try it out. There is also a feature in the network driver that is not implemented yet (multicast filtering), because the datasheet is not very clear ( ~~I would be very happy if someone could help me to complete this~~, wait now I suddenly understand the documentation).

For the bootloader, I am still using the default boot loader. This bootloader will load the kernel from memory 0x600000, and since I can’t change the bootloader configuration in this particular device, I modified the kernel configuration to match this. The latest code can be accessed at http://gitorious.org/freebsd-arm.

To do initial boot, you will need a serial port. You will need to put your kernel on your tftp server. Hit any key during boot, and type:

setenv serverip 172.17.1.1
setenv ipaddr 172.17.1.2
tftpboot 0x600000 kernel.bin
go 0x600000

and to make it permanent:

dd if=kernel.gz.tramp.bin of=/dev/flash/spi0 obs=4k conv=osync seek=96

Please note that the block size is 4k, and 96 means the offset is 0x60000 (96*4096) which will be mapped to 0x600000 by the bootloader. If you are brave, you can just compile the image and dd using the default Linux, but I don’t recommend this, since you may have different hardware (especially SPI flash chip).

Another news: I have completed the driver for ThinLinx Hot-e NAND using NAND2 framework. I also completed the SPI part and support for the flash SPI (read-only).