World’s First MIDI Shellcode
Jan 2025 · 45 min readI gained remote code execution via MIDI messages to trick my synth into playing Bad Apple on its LCD. This blog post is about my journey with this reverse engineering project.
The beginning
I’ve had this Yamaha PSR-E433 synth for a very long time, and a couple of years ago I decided to open it up — partly because it was in need of cleaning, and partly because I was really curious about its internals. After removing some screws and digging up the main circuit board (labeled “DMLCD”), I was quite amused to find two flash chips, one RAM chip and an absolute unit of a chip labeled “YAMAHA SWL01U”, which I guessed had to be the brains of the operation. Using that part number I wasn’t able to find any information about the chip online apart from an article that claimed it was based around a “SuperH” CPU core – an ISA that I’ve encountered for the first time ever in that article. So, after finishing the cleanup I just put the synth back together, which left me wondering about what that mysterious chip really had under the hood.
Fast forward to a few months ago, when I took apart the poor synth again – this time purely out of curiosity. What sparked that curiosity was a service manual for a similar synth (the E443, I own an E433) that I found online, which among other things featured a pinout of that main chip that listed pin descriptions so enticing (“TESTN – Test Mode”, “PROTN – Determines if the product is a prototype”) that I just had to get a look at what was going on. There were also two bidirectional UART interfaces, and by looking at the schematic I could see that one of the two transmit pins wasn’t connected anywhere, suggesting that the chip maybe emits some kind of log via that pin. Oh, and it also had JTAG test points nicely broken out on the board - basically a 5-pin interface for various production line testing and debugging-adjacent tasks.
So, what were my options at that point? I could:
- Play around with the TESTN and PROTN pins and see how the synth behaves;
- Solder to the UART Tx pin and see what the chip outputs;
- Connect to the JTAG interface and read the chip’s identification code;
- Desolder one of the two flash chips and dump the firmware.
Let’s begin with the first approach. Both of the boot mode select pins end with an N, suggesting that these pins are active low, meaning that the signal is considered active when the voltage is close to zero, as opposed to the power rail, which in this case is 3.3 volts. The schematic says that both of these pins are pulled up to 3.3 volts with a resistor, so we can just short the pins to ground in order to activate them. That’s exactly what I did; unfortunately, it appeared as though activating the TESTN pin just prevented the synth from booting, and activating the PROTN pin didn’t change the synth’s behavior at all. Hey, at least I didn’t brick it!
Next up, let’s try looking at the UART interface. That pin that I mentioned didn’t lead anywhere, not even a test point, which means that I had to solder directly to a 0.3mm wide pin of the chip. No success this time either, as the chip didn’t output anything in any of the 4 combinations of the TESTN and PROTN signals.
It was now JTAG’s turn. Even though the next option (desoldering a flash chip) was quite scary as it meant that I had to build a flash dumper (I didn’t have one), messing around with the JTAG was even scarier for another reason. The thing is that JTAG is quite an abstract interface that vendors can build whatever they want on top of. In order to talk to a device via JTAG, you have to have a detailed description of the circuitry that builds on top of it, which usually comes in the form of a BSDL file. There’s basically only one command that almost every device supports, and that is reading the IDCODE – a 32-bit number that acts as an identifier for the type of device you’re talking to. Let’s hook a J-Link up to our board and try to read that identification code using OpenOCD.
Well, that’s something. The IDCODE is reported as 0x3f0f0f0f, which is suspiciously pretty. So suspicious that I triple-checked my wiring, but nope, looks like that’s the actual IDCODE of the device, which after a quick Google search seemed like it belonged to either an STMicroelectonics STR7xxx or an Atmel SAM7xxx microcontroller, both of which were based around an ARM7 CPU core. My only option was to assume that I was dealing with an actual ARM7TDMI core like the one that these MCUs are based on. On the other hand, incorrectly talking to a device via JTAG risks catastrophic damage, as some implementations of the interface grant very low-level access to the hardware, even lower than the machine code that CPU cores execute have. There’s a small chance of letting the magic smoke out when you instruct the device incorrectly at such a low level, provided the circumstances turn against you. Anyways, I did it; I told OpenOCD that I’m dealing with an ARM7TDMI core and it happily complied.
At least at this point, the magic smoke was still contained within the chip. I nervously connected to OpenOCD via GDB and tried pausing and resuming execution of the program. I was very surprised and excited to witness the current draw reported by my lab bench power supply reacting predictably to my commands. The entire circuit board was drawing about 115 mA when running and about 98 mA when paused, which was a very good sign that what I was talking to was, in fact, an ARM7TDMI core. At that point I had no other way to verify whether the thing’s CPU was really stopping or not.
Dumping the firmware
Wow, it looks like I won’t even have to desolder the flash chip in order to dump the firmware! And I already know what ISA the chip is based on, so I won’t have to go digging around in the firmware image in order to find that out! Looking in the documentation for ARM7TDMI, the reset vector is located at address 0, so let’s see what kind of data there is at that address.
Yeah, okay, it’s a jump, just as I expected. The very next instruction is some other vector, and it’s a jump as well. That looks about right. Yeah, we’re definitely on the right track! I know the size of the flash chip (16 MiBytes), so let’s just dump 16 MiBytes of data starting at address 0 into a file, load it up into Cutter and see what secrets it contains.
I’m very unexperienced when it comes to reverse engineering, but one thing that I do know is that strings are a goldmine of easily digestible information about a piece of software. That’s why the first thing that I do when starting an RE project is look at the “Strings” section in an RE tool. This project was no exception, and I was very pleased to see strings such as “This code can only run on a Thumb compatible processor”, “Illegal address (e.g. wildly outside array bounds)”, “Abnormal termination (e.g. abort() function)”, and most of all, “SWL01U Internal”.
What I didn’t like is how the very few strings that were there in the image repeated every 64 KiBytes. So, for instance, the string “SWL01U Internal” was contained at addresses 0x0000bfd0, 0x0001bfd0, 0x0002bfd0 and so on. Both this repetition (likely caused by a primitive design of the address decoder inside the chip) and that string itself hinted that I took a dump of some kind of memory inside the chip itself, and not one of the external flash chips like I had originally imagined. I concluded that this SWL01U chip contains a 64KiByte ROM.
The instruction at the reset vector was a jump to address 0x02000000, which I thought might actually be the external flash chip this time. I once again took a 16 MiByte dump starting at that address, and was pleased not to find any repetitions this time. Also, I observed a large amount of strings that I could recognize just from using the synth, such as “GrandPno”, “Tr1 will be OverWritten!” and “BogiWogi”.
So, what do we know so far? We know that the chip itself contains a 64 KiByte ROM that immediately transfers control over to the external 16 MiByte flash chip upon startup. The ROM is located at address 0x00000000, and the flash starts at 0x02000000. We have dumps of both memories and can now start reversing the firmware of this synth to hopefully gain more information about its main chip.
Reversing the firmware
After staring at the flash image for about an hour in Cutter, it became very obvious to me that this RE tool just wasn’t going to cut it (pun intended) and that I needed to switch it out for something more powerful. I’m happy to report that Ghidra met my expectations.
Now, we have to get a little philosophical here. In my eyes, RE is like a game of minesweeper. You start with an empty field not knowing the state of any of the cells, i.e. not knowing whether each individual cell contains a landmine or not. When you discover the state of a cell, you have the context to deduce the state of its neighbor cells. In minesweeper, you don’t have a particular direction in which you progress. You never say “In this game of minesweeper, I want to go up no matter what”, you just let the numbers nudge you in the direction that is the easiest to go in at the moment. I assert that this is also true for RE. Once you find out what a function or a variable does, you suddenly understand a little more about functions and variables that depend on the ones whose meaning you’ve just inferred. It may be beneficial not to set any particular goal with an RE project, and instead letting the complex network of intertwined functions and variables guide you towards understanding the system as a whole.
So, where do we start? Right now we have two entry points from which we could begin prying the firmware apart: the reset vector and the strings. I tried both, just spending night after night learning more about the next function based on new insights gained from learning more about the previous one. This process is not very exciting to witness from the outside, so I don’t feel the need to retrace and describe my steps here. It’s just a chain of simple logical conclusions which propagate through the codebase. Like those little flags propagating through the field in a game of minesweeper.
There’s one subsystem in the firmware that I think is worth mentioning as it plays an instrumental role in the whole “Bad Apple” thing: The Shell. As I was digging around in the “Defined Strings” section of Ghidra, I noticed a cluster of a few ones that looked like they might be some kind of list of commands for some kind of a shell:
In RE, so-called “xrefs“ (cross-references) take center stage. When you’re looking at a symbol (a function or a global variable), xrefs tell you what other symbols use (reference) the symbol that you’re looking at. In the screenshot above, most of our strings have one xref. Let’s follow each of them and see where they lead us to:
What we’re seeing here is a sequence of pairs of references, where the first item in the pair is always the name of a command, and the second item is a pointer to some function. Only the first element in this sequence is referenced directly, which leads me to believe that this is an ordinary C array of C structs with two members. Let’s name this array, so that when we encounter this variable being used somewhere in the future we instantly know what it is.
Let’s now look at some code! Normal programs (like .exe or ELF files) consist of sections with clear designations for what sort of data they contain. For example, the .text section contains executable code and the .rodata section contains read-only data that the code requires. Unfortunately, embedded systems don’t typically use these files, and instead throw the code and data together in one large pile. This also means that there’s absolutely no hope of recovering symbol names and locations. Without symbol metadata, the stream of instructions is just that: a stream. Fortunately for us Ghidra has been programmed to at least recognize the boundaries of most functions, which it tends to do really well.
As this was my first time dealing with ARM assembly, the C decompiler feature of Ghidra turned out to be very useful for me. Unfortunately, due to a total lack of symbols its output is still quite hard for me to process. Take a look at this function which references the array that we looked at earlier. Don’t read into it, just skim over it:
Like I said, because Ghidra has absolutely no type or symbol information, the resulting C code is not something that you’d typically write and keep your job afterwards. Functions and global variables don’t have any meaningful names and are instead referred to by their addresses. Local variables don’t have meaningful names either, and they’re scoped to the entire function, as opposed to any particular block. Sometimes Ghidra thinks something is a local variable when really it’s better represented as a temporary result from an expression. It’s absolutely not the fault of the tool: all this information that makes code easy to understand is erased when it’s is compiled and the symbols are stripped away.
Making sense of this heavily processed code is what’s so hard about RE, and it’s one of those things that you learn by doing a lot of. From now on, for the sake of clarity, I’ll only be presenting you the cleaned up C code after I’ve made sense of it. Anyways, we’re clearly dealing with some kind of state machine. Notice the outline of this function:
There are two states in which the function does very little, and one state in which the function does a lot. Judging by the strings that the first two states reference (“login” and “Passwd Error”), this function implements some kind of login interface and only lets us run a command if we’re logged in. This function is only ever called by one other function, so let’s inspect that one:
This function is going through some sort of buffer and calling another function for each character that it fetches from the buffer, and only calls the function that we looked at in the previous paragraph for every ‘\r’ (carriage return) character. Furthermore, the buffer appears to be a circular one with a size of 256. Let’s name some of the variables and functions to what I think they do based on those new insights:
Let’s name some variables once again and dive back into our “shell_run_command” function, this time with even more symbols labeled (I’ve glossed over most of the boring straightforward symbols):
If we dive into the “shell_print” function, we see lots of yet unknown data transfers into global variables. These global variables are referenced by other pieces of code (both in the flash and internal ROM) which write data into mysterious addresses located at 0xfxxxxxxx, which I’m assuming is the memory region that’s used to talk to various peripherals inside of the chip.
Okay, so what do we know about this shell?
- It won’t respond to our commands unless we say “login” and type in the password “#0000”;
- It has quite a limited set of commands and is potentially uninteresting;
- We still don’t know how to access that shell.
Let’s list out potential candidates for various interfaces that this shell could be running on top of:
- UART. There’s two documented UART interfaces. Based on the schematic, both receive pins and one of the two transmit pins are used as GPIOs, and the other transmit pin doesn’t do anything (remember the previous section?).
- USB. There are two USB interfaces on this synth: one is a device interface implemented by the SWL01U chip itself, and the other is a host interface for connecting pen drives and such, implemented by an external host controller chip. If a shell is running on top of one of them, it’s probably the device interface, not the host one. However, if we connect the synth to a PC and run “lsusb” to dump its USB descriptor, we see that it has nothing but MIDI, an interface widely used in the music industry for transferring various music-related stuff such as “note on” and “note off” events. No serial ports or anything like that.
- JTAG. The documentation for ARM7TDMI says that its JTAG implementation features something ARM calls the DCC, which lets a program running on the chip and an external debug probe exchange custom data. It’s bidirectional and could thus be very well used for a shell. The DCC is accessed via special coprocessor data transfer instructions (MCR and MRC) in 32-bit words.
If it’s UART, then it’s definitely not accessible on our variant of the board, but nevertheless the code shouldn’t be greatly modifying the data that it wants to send, as UART operates on a byte level. If it’s USB, then it must be running on top of MIDI and must thus be manipulating the data in a way that’s suitable to send over MIDI in one way or another. If it’s JTAG, then it must be running on top of the DCC and must be using special instructions that access the DCC. Let’s look deeper into how exactly our “shell_print” function mutilates the data:
It seems to be breaking up each byte of data into two 4-bit nibbles and wrapping each of the two in its own byte. Every block of data that it passes on to the next stage in this data transfer pipeline starts with the same 8 bytes of data, followed by the payload, finally ending with an 0xf7 byte. Let’s use GDB to look at what those constant 8 bytes are:
All in all, a shell packet containing the string “> ” looks like this:
Here’s some context for those of you who don’t know how MIDI works. MIDI is a really simple protocol that emerged in the 80s and to this day allows various digital musical instruments to interoperate by sending and receiving messages such as “Please play the note C#4 with a loudness of 40 out of 127”, or “Please set the reverb level to 14 out of 127”, or “This is a tick. Assume that the period of time between the current and last tick corresponds to 1/24th of a quarter note”. MIDI has a few different message types, but they weren’t enough to describe every aspect of sound generation, so they introduced a special message called the System Exclusive message, or simply SysEx. In the words of the specification, “This message type allows manufacturers to create their own messages”.
Sooooo.... it was MIDI, right? Every SysEx message starts with an 0xf0 byte (just like our shell packets do), followed by 1 or 3 bytes of the manufacturer ID, followed by the payload, finally ending with an 0xf7 byte (again, like our packets do). The SysEx payload can only contain bytes in which the MSB is 0 because MIDI uses the MSB to differentiate between command and data bytes: 1 means it’s a command, and 0 means it’s data associated with the last command – this is exactly why “shell_print” is cutting the bytes up into 4-bit nibbles. Let’s look at the first data byte that the synth sends out (0x43) and see what manufacturer that corresponds to.
So yeah, these madlads made a shell that runs on top of MIDI SysEx messages on top of USB. Very cool. Let’s cook up a Python script that acts as a translation layer between the terminal and the synth’s twisted little shell protocol and try talking to it.
This is extraordinarily cool! I wasn’t really expecting this to work, as there’s a possibility that the format of the incoming messages is different from that of the outgoing ones. Fortunately, that turned out not to be the case. Although I have to say that the available commands are quite boring. Apart from your standard help and version information, the most interesting commands that we have are arbitrary memory read/write commands. So, if we really wanted to, we could just peek and poke the memory of the synth via MIDI. We don’t need JTAG for that.
Shellcode
Now, what can we do with arbitrary memory poke commands? We could inject executable code into RAM, but we could never execute it. Right? Wrong! If we overwrite the call stack of the program, we can trick the synth into executing it once it finishes handling the command. This is binary exploitation 101, except we don’t have to find any buffer overflow vulnerabilities, the memory poke commands are right there!
Let’s talk about data transfer speed. Our 32-bit memory write command takes the form of “m/l AAAAAAAA DDDDDDDD\r”, where A and D are the address and data respectively, expressed in hexadecimal. Each byte of the command is transformed into two bytes containing 4-bit nibbles of the original byte. It’s also extended with 9 additional bytes of the SysEx message. Then, every 3 bytes are wrapped in a 4 byte long USB-MIDI packet. In total, if we want to write 4 bytes into the memory, we have to send the synth 72 bytes, which is 18x larger than the payload. But that’s not all! The synth will read the command back to us, with every individual character nicely wrapped in its own SysEx transfer, and finish off with the “> “ prompt. In total, us and the synth exchange 396 bytes, which is almost 100 times larger than the 4-byte payload! This low transfer efficiency definitely shows and will become a problem if we ever want to send large amounts of data (foreshadowing?)
I found a region of RAM which looks like it’s not used by anything and might thus be safe to put arbitrary data into. Let’s write a little assembly snippet that nicely asks the firmware to print “HeloWrld” to the 8 character long text portion of the LCD:
Let’s write a python script that takes our assembled snippet, transforms it into memory write commands and sends them via MIDI over to the synth, following up with another write in order to trick the firmware into running that snippet.
This took me quite a few tries to get right, but hey, it works! The nice part about this hack is that it doesn’t depend on any special interfaces like JTAG or UART. If we wanted to, we could write these messages to a MIDI file and play it on the synth like any other MIDI file. Hey, that gives me an idea.....
Ladies and gentlemen, I present to you: World’s First MIDI Shellcode.
Here’s the MIDI file in case you want to do the same thing with a Yamaha PSR-E433 running firmware version 1.02. DO NOT play this MIDI file on ANY other Yamaha device, or on a PSR-E433 running a different version of the firmware, as it’s going to act unpredictably. You have been warned.
Bad Apple
Displaying graphics turned out to be way, way, way harder than displaying text. First, let’s look in the datasheet for our LCD controller (ML9040A) to decide whether that’s is even possible from a hardware standpoint. Turns out, not really – it can only handle text characters on a dot matrix. Our LCD definitely has a dot matrix part, but it also has this note notation part, and a 7-segment part in the middle, and another 7-segment part on the right, and a chord notation part below it, and finally a keyboard display at the very bottom.
How does the firmware light these segments up in a custom pattern if the controller only supports text? Let’s look at the block diagram of our display controller.
We can see three memories:
- The Display Data RAM (DDRAM) is written to by the host (in this case, SWL01U) to change the text displayed on the display. The host never writes the image that it wants the controller to display; instead, it sends it plain old ASCII (with some extra characters), and the controller is responsible for translating ASCII into an image that can be displayed on a dot matrix.
- The Character Generation ROM (CGROM) is what actually performs this translation. This ROM is a simple lookup table. It spits out a graphical pattern that must be displayed at a particular row in order to form a particular character.
- The Character Generation RAM (CGRAM) allows the host to define up to 8 custom characters, which can be called up by using character codes 0 through 7 or 8 though 15.
The CGRAM is how the synth displays non-textual data and what we can use to display custom graphics in the dot matrix part of the LCD panel as well. Let’s use the assembly snippet from before to display the 8 custom characters in the dot matrix area.
No, it’s not displaying garbage. When I press down a key on the keyboard, two dots light up in the dot matrix area which correspond to a note in the notation area and a key in the keyboard area. When I let go of the key, those segments get extinguished. This confirms that the firmware manipulates the CGRAM in order to display its stuff below the dot matrix area.
From the countless sleepless nights of digging around in the firmware I’ve discovered a function that sends arbitrary data to the LCD controller. Let’s write another assembly snippet that exploits this function to upload some custom data to the CGRAM.
When I run this snippet, I can definitely see the data that I want displayed (in this case, a checker pattern) getting actually displayed in the dot matrix area. However, it’s quickly replaced with what the synth wants to display in the custom area. We definitely can’t play a video with this; we have to find a way to disable the part of the firmware responsible for updating the CGRAM. One way we could do this is to find the function responsible for that (which I’ve already done) and just replace it with an immediate return, causing it to not do anything. The problem is that this requires me to overwrite the synth’s flash chip, which I don’t want to do out of fear of bricking it. I specifically set out to make every experiment of mine instantly reversible through power cycling, which means that I’m only allowing myself to manipulate the RAM.
I remember noticing that this firmware runs what appears to be some sort of a primitive RTOS with some parts of it contained in the ROM of the SWL01U chip. There’s a set of constant global variables in the flash which define the callback functions for the tasks, as well as their stacks and other attributes which I couldn’t figure out the meaning of. So, if we could a) find out which of these 64 tasks is responsible for constantly updating the CGRAM, and b) find a way to overwrite the corresponding entry in the task table so that it points to a no-op function, we could effectively disable that part of the firmware.
The key to this puzzle is the fact that the ROM and the flash are very loosely coupled. On startup, the firmware in the flash tells the ROM where its task table is located, and the ROM remembers this information in a global variable located in the embedded SRAM. If we make a copy of this task table in the RAM, and then tell the ROM that the task table has moved to a new location, we could coerce it into using this new table which we can modify in an instantly reversible way. So I did just that! I figured out which task was responsible for updating the display and replaced its callback with the default idle task callback, effectively preventing the firmware from continuously updating the CGRAM of the display controller.
You can see that the first iteration has some artifacts, but the biggest problem is that the frame rate is very low. The reason for that is the extremely low data transfer efficiency that I was talking about. Even if we upload the executable snippet once and only replace its data section when we want to display a new frame, that’s still 6732 bytes of data transferred per 70 bytes of payload (64 bytes of CGRAM data plus a 32-bit return address overwrite). And it turns out that these transfers are really slow, which in our case translates to low frame rate.
The two biggest contributors to this low payload efficiency are: a) the fact that this data has to be wrapped in a command, and b) that the synth reads the command back character by character in these enormous packets. If we could manipulate the task table once again in order to assign our own callback for the shell task, we could capture raw data and choose not to respond with anything, which would eliminate both of these problems. This, together with another packing optimization brings the total transfer size per frame down from 6732 bytes to 92 bytes – a 73-fold decrease! The artifacting is still there, but we’re now able to play video at a tolerable framerate.
Now, what causes this artifacting? The synth uses the same 8 GPIO lines for both talking to the display and scanning the panel with button controls and LEDs. One of the tasks is responsible for intertwining LCD accesses with panel scanning, and sometimes while we’re transferring our data to the LCD unbeknownst to this task, it decides to interrupt us and do a scan of the panel, which messes with the same data lines that the display is currently actively listening to, which causes these artifacts. To avoid this, we could stop talking to the display directly, and instead nicely ask that multiplexing task to send the data that we want once it’s done with the panel scan.
So there you go! The algorithm to display video on the LCD of this synth over MIDI is as follows:
- Log into the shell;
- Write executable code into RAM using memory write commands provided by the shell;
- Execute the code from RAM by overwriting the return address on the stack;
- Make a copy of the task tables in RAM;
- Fix those new tables up so that they point to each other;
- Tell the ROM to use our new task tables;
- Replace the display task callback with the default idle callback;
- Replace the shell task callback with our own callback;
- In that callback, unpack data arriving via MIDI and transfer it over to the display/panel multiplexing task;
- Feed our synth video frames via MIDI.
This project is not quite done yet. I have a very limited understanding of the chip’s MMIO region, and absolutely no understanding about its most interesting part - the DSP that’s separate from the main ARM core. Stay tuned for when I figure those things out :)