NEWS, EDITORIALS, REFERENCE

Subscribe to C64OS.com with your favorite RSS Reader
January 31, 2021#108 Software

GeoRAM, DigiMAX and NMI Programming

Post Archive Icon

Welcome to 2021, let's hope it turns out better than 2020.

In August of last year—which feels literally like yesterday—I published a weblog post about Playing Back PCM Audio on our beloved brown friend, the Commodore 64. I wrote that in response to an email exchange. I thought a blog post made more sense than just replying in an email.

I walked through how to play digital audio on the C64, in theory, and tried to come up with suggestions for how to work with an REU. I recommended supporting GeoRAM as an option because it's still commercially available in a few forms: (GGLabs, Shareware Plus, Garrett's Workshop)

Until recently I didn't know anything about GeoRAM. When I looked into it, I realized that it might be better suited to digital audio playback than a traditional 17xx REU. While working on a forthcoming update to the Commodore 8-Bit Buyer's Guide I was looking for new gear that's appeared on EBay since the last update, and I stumbled upon a new 2MB GeoRAM Clone from a maker called Garrett's Workshop.

It comes in a translucent, injection-mold stubby cart produced by The Future Was 8-Bit, and has a nice professional feel. It was only $45 so I decided to spring for it. It arrived remarkably quickly and I was stoked. Here it is:

2MB GeoRAM Clone from Garrett's Workshop.
2MB GeoRAM Clone from Garrett's Workshop.

What else could I do besides spend a few days implementing what had just been some theorizing from 5 months ago? That turned into GeoRAM Digi. I'm releasing it on GitHub with the MIT License. Use the code, fix it, change it, incorporate it, share it, even sell it, just give me credit for the work I've put into it by keeping the MIT License and copyright information attach to it if you do make use of it.

Okay! Let's dig into this shall we?

 

What is GeoRAM Digi?

GeoRAM Digi is a program for the C64 for playing RIFF/WAVE files. It uses a GeoRAM for its RAM expansion and outputs 8-bit stereo audio via a DigiMAX. GeoRAM plugs into the Cartridge Port and DigiMAX plugs into the User Port. (DigiMAX is commercially available too from Shareware Plus.)

The goal was to make a minimalist NMI-based player, to see how it can be done, and to use both the GeoRAM and the DigiMAX. It also shows how to support the RIFF/WAVE file format and how to support more than one channel, more than one bitrate and more than one sample rate.

I won't be building a complex user interface for GeoRAM Digi, because that is a job for a proper C64 OS application sometime down the road. C64 OS excels at making rich object-oriented mouse-driven UIs. GeoRAM Digi should be considered more like a command line tool, than like an application. This also keeps it trim so you can use the essential routines in your own application, which I hope someone will.

What does GeoRAM Digi support?

  • GeoRAM detection
    • Cartridge presence
    • 512K capacity
    • 1MB capacity
    • 2MB capacity
    • 4MB capacity
  • Audio output via DigiMAX
  • RIFF/WAVE file format
    • Chunk decoding
    • Skips unsupported chunks
    • Uncompressed PCM data
    • Mono and stereo samples
    • 8-bit and 16-bit samples
    • 22kHz, 16kHz, 11kHz, 8kHz sample rates *1
  • NTSC and PAL timing
  • Loadtime downsampling from 16- to 8-bit *2
  • Audio dithering via SID noise
  • Outputs mono samples to both channels
  • Displays detected metadata
  • Displays loading progress
  • Truncates clips that exceed available memory
  • Keyboard support
    • SPACE = Pause/resume playback
    • s = Stop
    • ← = Rewind
    • F1 = Skip back 5 seconds
    • F7 = Skip forward 10 seconds
    • +/- = Toggle screen on/off
    • r = Repeat mode on/off
    • CTRL+Q = Quit to READY.
*1 Sort of. We'll see what this means.

*2 This might not be the correct meaning of downsample. Downsample might only be for changing the sample rate. I don't know the right word for when you change the number of bits per sample, but I know some people refer to this, loosely, as downsampling.

What is GeoRAM?

GeoRAM is a misleadingly named RAM expansion for the C64 and C128. It was called GEORAM because it was first made by Berkley Softworks, the people behind GEOS. It was an inexpensive way to expand the C64's RAM to speed it up and improve GEOS.

By the name you might think it can only be used by GEOS. Remember WinModems? They were completely useless on a Commodore because the hardware was half-implemented in Windows-only drivers. This is not the case with GeoRAM. It is different than Commodore's 17xx REUs but any C64 software could easily make full use it. You can even use it from BASIC with a few POKEs.

Original GEORAM from Berkley Softworks, 1989.
Original GEORAM from Berkley Softworks, 1989.

How is GeoRAM different from a 17xx REU?

There seems to be a lot of confusion over the difference between these two types of RAM.

Commodore released the first 17xx REU alongside the Commodore 128. And they released the 1764 REU shortly thereafter targeting the C64. Truth be told, all of the 17xx REUs can be used on a C64 as long as you also upgrade to a more robust power supply. There are currently many options for new, compact and robust power supplies for the C64.

A 17xx REU relies on a custom MOS chip, called the REC or RAM Expansion Controller. It's quite sophisticated, a kind of special purpose processor. The REC is mapped into I/O, and you program it via its registers. You give the REC an internal range of its own memory plus a range of memory in the C64's main RAM, and then you give it a command to begin a transfer.

There are a few different options, but a transfer either goes from the C64 to the REU, or from the REU to the C64, or it does a full swap at half the speed. When the transfer begins the REC chip asserts the /DMA line which puts the C64's CPU on pause. The REC takes over the bus and uses it to read and write to main RAM much more efficiently than the CPU can. The CPU is general purpose and needs to execute several instructions to copy just a single byte. The REC is designed specifically to move blocks of memory and is preloaded with all the information it needs to do the move. The REC can thus transfer close to 1 byte per clock cycle.

The REC's job is made slightly more complicated because the VIC-II needs to steal cycles from the CPU to fetch an extra ~40 bytes every 8th rasterline of the bitmapped screen area. (The so-called bad lines.) The VIC-II asserts BA (Bus Access) when it needs the bus during the phase when the CPU usually has rights to the bus. This line allows the REC to share the bus with the VIC-II just as the 6510 does. Needless to say, the REC is a sophisticated chip with many nuances and strict timing requirements.


GeoRAM is totally different. It is much simpler. GeoRAM is easier to implement in hardware, but it is less powerful, with fewer features than a 17xx REU. On the other hand, in certain situations those differences can have a few advantages.

The C64 reserves 2 pages of addressing space (256 bytes each) to be used by external I/O devices. These are I/O 1 and I/O 2. I/O 1 is from $DE00 to $DEFF and I/O 2 is from $DF00 to $DFFF. GeoRAM takes over all of I/O 1 and uses it as a 256-byte sliding window into its own RAM. It also maps two write-only control registers into the last 2 bytes of I/O 2, $DFFE and $DFFF. By using only the high two bytes of I/O 2, it can usually play nicely with another cartridge that uses the lower portion of I/O 2.

To use it, here's an example scenario: You read or write to any address from $DE00 to $DEFF just as though it were regular RAM. However, internally you are writing by default to Page 0 of Bank 0 of the GeoRAM's memory. Then you write $01 to $DFFE, and instantaneously the page visible inside the $DExx window is different. Everything you just wrote to Page 0 has been slid one conceptual frame out of view. Like this:

GeoRAM's Sliding Window.
GeoRAM's Sliding Window

Now anything you read or write to any address from $DE00 to $DEFF is going into Page 1 of Bank 0 of the GeoRAM's memory. To bring back Page 0, write $00 to $DFFE. Instantly, Page 1 slides out of view and Page 0 slides back into view. This is literally everything that GeoRAM does. It has no special features of any kind. Just a sliding 256-byte window visible at $DExx (I/O 1) and two address bytes at $DFFE and $DFFF to control which GeoRAM page should be visible through the window. The end. It's very simple.

GeoRAM from BASIC

You can test out GeoRAM from BASIC.

The hexadecimal range from $DE00 to $DEFF is 56832 to 57087 in decimal. Write some data to these addresses like this:

10 open5,8,5,"mytext.txt"
20 for i=0 to 255
30 get#5,a$
40 poke56832+i,asc(a$+chr$(0))
50 nexti
60 close5

Boom, you just copied 256 bytes from the file mytext.txt into a page of GeoRAM.

To be certain of what page you're writing into, it's safest to write the bank and page numbers to the control registers first. Hexadecimal $DFFE and $DFFF are 57342 and 57343 respectively. Want to change to Page 5? Try this:

poke57342,5

Now the text data you copied in is out of view in page 0, and page 5 is in view. You can repeat the process to copy a different text into page 5. Need to get back to page 0? No problem.

poke57342,0

By the way, if you want to play around with this but don't have a GeoRAM, but you have a 1541 Ultimate II or an Ultimate 64, then guess what? You also have a GeoRAM! You can go into the settings menu of the 1541U or U64, then into the C64 settings, cartridge menu, and turn on GeoRAM. Beautiful. GeoRAM is also supported by VICE.


GeoRAM Banking and Addressing

There is a complication. (When isn't there a complication, right?) For reasons I don't fully understand, but which probably relate to how the RAM was wired up in the original GEORAM, each bank contains only 16K of memory, not the theoretical addressable maximum of 64K. Let's think about what this means for programming it.

Imagine a single contiguous range of memory that can be addressed by 24-bits. Each address would consist of 3 bytes:

  • A Low Byte
  • A Mid Byte, and
  • A High Byte

This sounds like GeoRAM, the low byte addesses the 256 bytes of the page visible at $DExx. The mid byte slides the window left and right to show different 256-byte pages. And the high byte selects banks, which are collections of pages. With a true 24-bit address, each byte has 256 possible values: 256 bytes in a page * 256 pages * 256 banks = 16,777,216 bytes, or 16MB. But a GeoRAM cannot have up to 16MB, it maxes out at only 4MB. Why?

With GeoRAM, the low byte addresses the 256 bytes of the page visible at $DExx and the high byte can select up to 256 different banks. The problem is that the mid byte isn't a full 8-bit, it's only 6-bit. It can't specify up to 256 pages, but only 64 pages. 64 pages * 256 bytes = 16,384 or 16K per bank. In other words, to address any given byte within GeoRAM you don't have a 24-bit address but a 22-bit address. Like this:

22-Bit Addressing Scheme.
GeoRAM's 22-Bit Addressing Scheme

If you only want to deal with banks of no more than 16K, then this isn't so bad.

But when we're dealing with digital audio, we don't want to be capped at 16K units of memory. That is enough for only 1.5 seconds of mono audio at 11025Hz. We need hundreds of kilobytes perhaps megabytes of contiguous space to support long audio files. In this case, we really want to conceptualize the GeoRAM's full capacity as though it were one contiguous range. Even if it maxes out at 2MB or 4MB, we just want them to all be together as one block.

GeoRAM's 6-bit mid byte is actually rather inconvenient for that because the full range is not in fact contiguous. Those two grey bits in the middle represent 48KB gaps in the addressing range after every 16KB. What we want is for every bit to be used without any gaps, up to the total capacity. It would still need 3 addressing bytes. If it were a 512K unit, the first 19 bits in a row would be used. 1MB? The first 20 bits in a rows would be used. With all the unused bits coming at the very end.

These 3 bytes could be incremented, multiplied, divided, or have offsets added or subtracted without any weird manipulations in the middle. GeoRAM's addressing is problematic because you cannot simply increment the 3 address bytes to progress sequentially through the space. As we will see later, this will matter when we want to implement audio skipping, such as back 5 seconds or forward 10 seconds.

To rectify this, we are going to abstract the addressing with 3 normalized address bytes that can be easily manipulated. Then, at the end, the normalized address will be translated into the GeoRAM's discontiguous addressing scheme by sliding all the bits after the first 14-bits up two positions. We'll return to this later, as it will have both benefits and ramifications.

Normalizing GeoRAM's Addressing Scheme.
Normalizing GeoRAM's Addressing Scheme

Outputting Messages to Screen

We're going to walk through the whole program, so let's start with outputting messages to the screen.

The code is in assembly, but the BASIC ROM contains many useful routines that can be called from assembly code. One is the string out routine, at $AB1E. It takes a pointer to a null-terminated PETSCII string in Y and A and prints it to the screen.

A "string" include file will give us a useful string lookup routine. We will be outputting canned messages, as well as strings whose pointers we will manually provide. To make this easy we'll make a couple of helper routines: msgout and strout. These are in the flavor of the KERNAL's chrout routine. They also include an extra carriage return so each message out or string out will end up on its own line.

There is one additional messaging routine we'll see later, errout. This just calls msgout then closes any open file. This is a useful way to bail out of the program early if we encounter an error condition.

We need a collection of messages to tell the user about error conditions or its progress processing and loading the file. Here's how we can include a set of canned messages.

With this, it's now super easy to print a message to the screen. Load the message code into the accumulator, JSR to msgout, and boom, the message is on the screen. Very easy.


Detecting GeoRAM: Presence and Capacity

It is essential that a GeoRAM be present. We'll start by detecting the GeoRAM's presence, and then use a simple trick to test for and discover its available capacity. The original came in 512K, and capacities double up to the maximum of 4MB. So we only need to detect 4 possible capacities:

  • 512K
  • 1MB
  • 2MB
  • 4MB

If no GeoRAM is present, the program will errout with that message. Otherwise, we'll print out the capacity that was detected.

Additionally, the detected capacity will be configured in the 4-byte maxmem variable. RIFF files use 32-bits for sizes.1 Although that is enough to support up to 4GB, and we have a mere 4MB (or less), we'll keep our numbers in 32-bits too, because it'll make the comparisons easier.

Bytes of a multi-byte variable are numbered starting from 0. In a 32-bit (4-byte) variable, the low byte is 0, the mid-low byte is 1, the mid-high byte is 2, and the high byte is 3. The mempgs table holds values that will be plugged into byte 2 of maxmem.

Note, maxmem is in the normalized addressing scheme. That is, all the bytes are a full 8-bit. The conceptual banks are 64K not 16K. Therefore, a 512K GeoRAM will have 8 banks of 64K, (not 32 banks of 16K). And a 4MB GeoRAM will have 64 banks of 64K, (not 256 banks of 16K). Thus, the $08, $10, $20 and $40 in the mempgs table are the number of 64K banks.

Here's the detection routine:

The act of detecting GeoRAM and its banks is currently changing the data. If the RAM already contains data, this will corrupt it. Therefore, this routine should be improved by reading the existing values first, backing them up, and restoring them after the detection is complete.

Here's how the detection works.

For all capacities of GeoRAM, banks $00 to $1F are available. For all capacities except 512K, banks $00 to $3F are available. But on the 512K unit, if you select a bank from $20 to $3F the upper bits are ignored and it actually selects the corresponding bank from $00 to $1F.

This pattern continues. For 2MB and 4MB units, the banks from $00 to $7F are available, but on 1MB and 512K units the upper bits are ignored and lower banks are actually selected. Personally, I find it easier to visualize this using binary instead of hexadecimal.

Capacity Hex Range Binary Range
512K $00 to $1F %xxx00000 to %xxx11111
1MB $00 to $3F %xx000000 to %xx111111
2MB $00 to $7F %x0000000 to %x1111111
4MB $00 to $FF %00000000 to %11111111

The X's in the binary range show the bits that are ignored. We can take advantage of this.

Start by selecting page $00 (write $00 to $DFFE), so page 0 will be used in each bank. Then we'll write capacity markers into selected banks that support that capacity.

Select bank %11111111 (write %11111111 to $DFFF) and write $04 into $DE00.

If we in fact have only a 2MB GeoRAM, the high bit of the bank number is ignored, and $04 actually goes into bank %01111111. Similarly, if we have only a 1MB GeoRAM then $04 actually goes into bank %00111111, and so on. We continue to write smaller markers into lower banks.

Select bank %01111111 and write $03 into $DE00.

When $03 is written to bank %01111111 it will leave $04 in the highest bank of a true 4MB GeoRAM, but it will overwrite $04 with $03 in a 2MB GeoRAM. We continue this process writing $02 into bank %00111111 and $01 into bank %00011111. On a 512K GeoRAM all the upper bits were masked, so first a $04 went into bank %00011111, then a $03 over wrote it, then a $02 over wrote it and finally $01 overwrote it. Whatever marker is left in when we read from the highest bank, that's the true capacity. Clever.

But what if there is no GeoRAM at all? In that case, when we read from $DE00, we're not reading byte 0 of page 0 of RAM from some configured GeoRAM bank. We might actually be reading a register value from some other cartridge occupying I/O 1. If that register value happens to come back as $01, $02, $03 or $04 we'd misinterpret that as a valid GeoRAM capacity.

To detect the GeoRAM's presence I write some magic values into I/O 1, and then read them back to see if what I wrote was stored and could be retrieved. This will certainly notice that a GeoRAM is not there if nothing occupies I/O 1, but I'm not sure this is good enough if some other cartridge occupies I/O 1.

A more robust way would be to choose a specific page, by writing to $DFFE, then write several magic bytes to $DExx, then change the page by writing another page number to $DFFE. Write a different set of magic bytes to the same places in $DExx. Then read back all the magic bytes from both pages to make sure the second set didn't overwrite the first set. This would virtually guarantee that either a GeoRAM—or something that works just like it—is present. Of course, it would also be prudent to restore any bytes modified by the detection routine.

Once we're sure there is a GeoRAM there, the only thing left to do is read the capacity marker from bank %11111111. On a 4MB unit it will be $04, on a 2MB unit it will be $03, on a 1MB unit it'll be $02 and lastly on a 512K version it will be $01. We can use the capacity marker as an index into the mempgs table to copy the number of 64K banks into byte 2 of maxmem.

To output a string of the detected capacity, use the marker as an index into memtablo and memtabhi to get a pointer to the correct string, and then display it with a JSR to strout. This is done from lines 70 to 75 above.


Opening Files and Reading Data Structures

We'll use the KERNAL to open and read data from the RIFF/WAVE file. The KERNAL is quite low-level, its routines generally operate on single bytes at a time. To read in data, you need to call the KERNAL's routines in a loop, and decide where in memory to put each byte.

A handful of helper routines will nicely abstract the jobs to be done.

  • openfile
  • errout
  • closfile
  • loadntoa
  • loadnton
openfile

A proper application or program would let you specify what file you want to open and from where. But, as I said at the beginning, that work will be the job of C64 OS and its frameworks. I don't want to spend any time on that for a simple tool like this.

GeoRAM Digi always opens a file named sample.wav from the current device. Someone else can embed these routines in a program that has a more robust way of selecting files. But, for this, you can just rename the file you want to open to sample.wav before running GeoRAM Digi. The current device is stored in $BA, it is set by the KERNAL when you load a program. You can either load GeoRAM Digi from the same device where sample.wav is found, or you can poke an alternative device number into $BA like:

poke 186,9  (To set the current device # to 9.)

Also, if you have JiffyDOS, pushing CTRL+D cycles through the current device numbers. As you cycle through it updates $BA with the device number it prints on the screen.

openfile opens the file, assigns it a default logical file number and then does a call to chkin so you are ready to make calls to chrin to read single bytes from the file.

errout

As mentioned earlier, errout calls msgout then falls through to closfile. When an error is encountered, rather than JSRing to errout, it JMPs to it instead. That prints the error message, falls through to close the file, and returns the user to the READY. prompt.

closfile

Very little to see here. It closes the file that was opened by openfile. It then calls clrchn to restore the keyboard and screen as the default input and output devices.

loadntoa

This routine stands for "Load N-Bytes to Address." It's a sort of fread but with an implicit file handle, the logical file number that was used automatically by openfile.

This is handy for loading structures from the RIFF file. You pass the number of bytes to read (1 to 256, 0 = 256 in this case) in the accumulator, and a pointer to where you want the bytes to be stored via Y/X (high byte/low byte.) We'll see how this is used shortly.

loadnton

There will be times when we need to skip some number of bytes in the file. The standard Commodore DOS doesn't support scanning ahead in the file. Instead, we'll call this, which means "Load N-Bytes to NULL." It takes only the number of bytes to read in the accumulator.


RIFF Wave File Icon.

The RIFF Wave File Format

In my post from August 2020, I sort of glossed over RIFF, merely linking to the Wikipedia article for the curious reader. It's probably the most common format for uncompressed audio clips. It was created by Microsoft and used extensively in early versions of Windows, before compressed audio like MP3, AAC and others. The Mac and Amiga used other formats, like AIFF, IFF, AU and SND, but because of the commercial success of Windows the RIFF format is much more common.

RIFF stands for Resource Interchange File Format. According to Wikipedia, the RIFF format that Microsoft introduced is more or less a little endian version of IFF that was first introduced in 1985 on… the Amiga!2 So, we need not turn up our noses at using something pushed by Microsoft on Windows.

When I first read a description of the RIFF format, over 20 years ago, I admit that I found it confusing. When I look at it now, I understand how it works, but I still find the way it is presented by some documentation to be confusing. In fact it's quite well done. It is designed to be easy to use and to provide conveniences to the programmer.

Chunked Format

The RIFF format is a container for different types of media. The file is divided into sections called chunks. All chunks begin with the exact same 8-byte header structure. Chunks can also be nested, which sounds more confusing than it is.

Every chunk starts with 8 bytes, a 4-byte identifier and a 4-byte (little endian) unsigned integer. The ID identifies the type of chunk and the 4-byte integer is the size of the chunk's data. The chunk's data immediately follows the 8-byte header. In a nutshell, that's it. That's the whole format.

We can visualize it like this:

RIFF File Chunked Format.
RIFF File Chunked Format

Here's the thing though, the data portion of a chunk can be arbitrarily structured. The structure of a given chunk's data is defined by the chunk's ID. Looking at the diagram above, the left column shows the basic structure. The first 4-bytes (in blue) have the ID "RIFF" (all the textual identifiers are in ASCII, not PETSCII. We have to pay attention to that on the C64.) The next 4-bytes are the little endian integer size of the data that follows. In this case it's $28,$00,$00,$00 which is just $28 or 40 bytes. Therefore, the data section (in grey) in that first column is 40 bytes.

The "RIFF" identifier tells us that the whole file is a RIFF file, but it's also the identifier of the first chunk. The integer size tells us how much data there is all the way to the end of the file. GeoRAM Digi doesn't use this first chunk's size info, but it is actually pretty convenient, especially on the C64. Why is that? Because Commodore File Systems don't typically tell us how big a file is down to the exact byte. They give us the number of 254-byte blocks the file occupies on disk. Calculating the exact size of a file to the byte is actually a huge pain, you have to follow block pointers all the way to the final block, then read the partial size on the final block. With a RIFF file it's much easier. Just load in the first 8 bytes. If the first 4 bytes are "RIFF" then the next 4 bytes specify a size that is precisely 8 bytes less than the file's total size, down to the byte. Done.

Since a chunk's ID defines the structure of its data, how is the "RIFF" chunk structured? The first 4 bytes are a media type ID followed by more data whose structure is defined by that media ID. So that middle column is not a proper chunk, it's the structure of the data of the "RIFF" chunk. Therefore, in the middle column, the 4-byte media ID, plus all the grey data that follows, together make up the 40 bytes of the "RIFF" chunk's data. Thus the third set of 4 bytes in the file is always a media ID, even though the media ID is not part of any chunk header. Why is that? Because the first chunk is always a "RIFF" chunk, and the data segment of the "RIFF" chunk always begins with a 4-byte media ID.

So how is the rest of the data structured, if the media ID is "WAVE" ? It's composed of an arbitrary number of chunks. Each of those chunks is just like every other chunk, it has an 8-byte header. The first 4 bytes are the chunk's ID, the next 4 bytes are the little endian unsigned integer for the length of the data of that chunk which immediately follows the chunk's header. In other words, the structure of one of these sub-chunks is identical to the main containing chunk.

Frankly, other descriptions of RIFF's chunked format are more confusing than they need to be. They often roll the media type ID into a special 12-byte file header, but then they dance around the fact that the size in this special header includes the 4-byte media ID, as though the size includes part of the file header, but not all of it. And then subchunks have a different header format that doesn't include a media type ID. Ackk! That is way more confusing than it actually is. The Media ID is just part of the data segment of a standard chunk whose ID is "RIFF".

WAVE Media Type's Chunks

The "WAVE" media type ID defines that there will be a "fmt " chunk and a "data" chunk. And that the "fmt " chunk will come before the "data" chunk. But there could be other chunks in there too.

Don't confuse the "data" chunk, (which is a proper chunk whose ID is "data",) with the data segment (in grey) of a chunk. As you can see in the diagram's rightmost column, there are two proper chunks. Each has a 4-byte ID, each has a 4-byte size, and each has a grey data segment whose length is declared by the chunk header's 4-byte size.

Note that it is impossible for the size of the sub-chunks to be greater than the size of the "RIFF" chunk. If the numbers indicated that, then the file would be invalidly structured.

The "fmt " chunk contains the metadata that defines the properties of the audio data. And the "data" chunk contains the audio data itself. So how is the "fmt " chunk's data structured? It is defined as exactly 16 bytes, and its header also states that it is 16 bytes.

Field Size Value
Audio Format 2 Bytes 1 = Uncompressed PCM
Audio Channels 2 Bytes 1 = Mono, 2 = Stereo
Sample Rate (Hz) 4 Bytes 11025, 22050, 44100, etc.
Byte Rate 4 Bytes Bytes of data per second of audio
Block Align 2 Bytes Sample Size * Audio Channels
Sample Size (Bits) 2 Bytes 8 = 8-bits, 16 = 16-bits

As always, the multi-byte numbers are little endian. Audio format is a 16-bit number. In order for it to be "uncompressed PCM" the first of those 2 bytes must be $01 and the second must be $00. The sample rate is a 32-bit little endian integer for the sample frequency in Hz. Therefore, 11025Hz would appear in this field as $11,$2B,$00,$00. And 22050Hz would appear as $22,$56,$00,$00.

The byte rate is actually pretty cool. It's a precomputed number that represents the number of bytes per second of audio data. This can be calculated as Sample Size * Audio Channels * Sample Rate. It's precomputed as a convenience. We'll see how to use it when we implement audio skipping.

Audio Channels is typically 1 for mono or 2 for stereo, but it could be more than 2. GeoRAM Digi only supports 1 or 2 channels. If the file contains more than 2 channels, an error will be output and the program will exit.

Sample Size is the number of bits per sample. Typically this will be 8 or 16. If any other value appears here an error will be output and the program will exit. Since the DigiMAX only supports 8-bit output it can't actually make use of 16-bit samples. 16-bit samples are very common though, we don't want to throw an error that says, "Sorry 16-bit samples not supported." So GeoRAM Digi downsamples them to 8-bit at load time (there is no time to do this during playback). This has the side benefit of occupying only half as much RAM, although at a reduced quality. We'll see how to do this and also how to make the most of the byte that gets thrown away.


Parsing the RIFF File Format

Now we know how a RIFF file is structured into chunks and sub-chunks. We need to load the data into structures in memory. To do this, we'll need appropriately sized blocks of memory. These can be statically allocated, no need for any fancy dynamic memory allocation.

The first four labels above are what we should expect to find at certain places in the RIFF file. These will be used to compare against the actual data to make sure we in fact are dealing with a RIFF/WAVE file.

Remember, the textual IDs are in ASCII. I code native in TurboMacroPro, and the strings typed into the source code on a C64 are in PETSCII. In the samples above, the source code has been migrated to a Mac and converted to ASCII. It's important that however the code gets assembled, the filetype string (The main chunk's ID) must be "RIFF" in all caps in ASCII. It so happens that the PETSCII lowercase letters map to ASCII's uppercase letters. So, in the C64 source we can put "riff" with a note that this is "RIFF" in ASCII.

Same goes with the media type, (above labeled as datatype,) the "WAVE" must be all caps in ASCII. The IDs of the two sub-chunks, however, are in lowercase ASCII. This is Block 4 of PETSCII, which is undefined. (See Commodore 64 PETSCII Codes for reference.) I've put in the byte values manually with a comment about what they represent.

To load the data into the structures:

We use our data loading helper routines. openfile will open sample.wav from the current device and configure the input channel to input data from it.

loadntoa is our version of fread. Load a pointer to rifffile into X/Y, put the number of bytes to read (12) in the accumulator and call loadntoa. That should load the main chunk's ID, the main chunk's size, and assuming the file is a RIFF file, the first 4 data bytes of the main chunk, which will be the data type (or media type, however you want to call it.)

Now we compare filetype to rifffile, 4 bytes each. If they differ, this is not a RIFF file! Load #$00 (Error code) into the accumulator and JMP (not JSR) to errout. This will print the $00 error message, close the file, and exit the program.

We'll keep going, progressing always to the next step, until any unhandlable condition is encountered.

Although the RIFF chunk's size could be useful, GeoRAM Digi doesn't need it. Next we compare datatype to rifftype, 4 bytes each. If they differ, this RIFF file does not contain WAVE media. This results in the same error, $00, which generically prints out "Unrecognized File Type". Yeah it's RIFF but it's not RIFF/WAVE.

WAVE Sub-Chunks

Because the media type is WAVE, we know that a series of proper chunks will follow, all the way to the end of the file. We can expect to find a "fmt " chunk, and we can expect to find a "data" chunk, and in that order. But we cannot assume that because the "fmt " chunk is an "8-byte header + 16-byte data segment" that therefore the "data" chunk must begin after precisely 24 bytes. Why not? Because the RIFF format permits additional chunks for other metadata.

In fact all of the sample .wav files I have (which are linked to at the end of this post,) have an additional "LIST" chunk. The LIST chunk's data segment, in turn, is composed of a series of sub-sub-chunks. However, because the LIST chunk is a proper chunk, it has a size for its entire data segment regardless of how it is further subdivided. Thus, to skip any chunk whose ID we don't recognize we just use loadnton to "Load to NULL" the size of the chunk's data.

We know the "fmt " chunk will come before the "data" chunk, and we don't care about any chunks that may come after the "data" chunk, therefore we'll loop and continue to process chunks until we've processed the "data" chunk. At that point we have everything we can use, and we'll just close the file and move on.

So let's see this in code:

In memory we only have a single 8-bit subchunk header structure. Each time we load in a new chunk header we'll load it over top of the old one. So that's the first step, loadntoa 8 bytes to subchnk.

We have our comparison strings for the "fmt " chunk ID and the "data" chunk ID. Compare this chunk's ID to fmt_chnk, if it's a match, we'll load and process its data, and then loop to fetch the next chunk. Otherwise, we compare it to datachnk, if it's a match, we'll load and process its data and then leave the chunk processing loop by JMPing to datadone. If it's not a "data" chunk, well then we don't really care what kind of a chunk it is, we don't support it. We don't know how big this unknown chunk is, so loop calling loadnton as many times as necessary while decrementing the chunk size, until all of it has been skipped. And then loop back up to next to fetch and process the next chunk.

Above, I omitted the loading and processing of the "fmt " data because it's a bit long. And I wanted the overall structure of loading chunks to be clearer. Below is just the code for handling the "fmt " chunk. But remember, it's inlined. So if it does a JMP errout, that will exit the program.

The specification defines the "fmt " chunk as 16 bytes. But the RIFF specification also indicates that the chunk size is in the chunk header. Therefore, to load a fixed 16 bytes or to load the number of bytes specified in the chunk header should be the same. I've opted to load the number of bytes in the chunk size. But I'm still making assumptions, for example, I don't bother to check the higher chunk size bytes, because they should all be zero. And, I've only statically allocated enough room for 16 bytes. If the "fmt " chunk's size is >16 bytes, some very bad things will happen, as there will be a buffer overflow.

The safer thing to do here is to confirm that the 3 higher size bytes are all zero, and that the low byte is 16. And if that's not true, to exit with some error message. This is a suggestion for improvement for anyone who might use these routines.

With that caveat, we'll loadntoa the low size byte (16 bytes) to our memory structure that starts at audfmt. Each element of the format structure has its own label. It is now a matter of testing each element to confirm that it falls within supported parameters. Any format element that isn't supported will generate a unique error message so the user knows what went wrong.

Test the audio format for $0001, Uncompressed PCM data.

Test the channels for $0001 or $0002, mono or stereo.

Now, along the way, let's spit out the relevant format details to the user. The relevant details are the ones for which we support variation. I don't bother to say that it's PCM data, because if it's not PCM data the error message will reveal that. GeoRAM Digi supports different channels, sample rates and sample sizes. A series of strings for these options are defined thus:

If we detected 1 channel, we'll load a pointer to str1chn in Y/A and JSR to $A1BE. We could also JSR to strout, but it outputs a trailing carriage return. Thus, strout is to output a full line, where as the more primitive $A1BE is to output a partial line.

Next, starting at line 40, we need to verify that we support the sample rate. This part gets a bit tricky. So let's pause here and discuss sample rates for a minute.


Handling Digital Audio Sample Rates

In a nutshell, the sample rate is the number of times per second the analog waveform was sampled. The higher the sample rate, the more information about the original waveform got captured and the more accurately it can be reproduced. To reproduce the original waveform the digital samples must be sent to the DigiMAX at as close as possible to the rate at which they were captured.

Sample rate diagram.
Sample rate diagram

To do this on the Commodore 64 we'll use a CIA timer. The timer counts down the time between the output of two samples. We need to know what timer values to use for the sample rates we support. We'll return later to configuring the CIA timer.

We'll set the timer with the starting 16-bit value. It counts down by 1 on every CPU clock cycle until it reaches zero. When it reaches zero it generates an interrupt. The CIA then resets the timer to the starting value automatically and continues counting down regardless of what the CPU is doing to respond to the interrupt. This is one of the most reliable ways to produce a stable and accurate repeating event.

The length of time between samples is very short, so the timer's high byte will always be zero. Each sample rate that we support will need a CIA timer low byte value. One wrinkle exists. NTSC and PAL Commodore 64's don't run at the same clock speed, and therefore the CIA timers count down at different rates, slightly slower on PAL machines than on NTSC. We divide the exact clock frequency of the machine by the sample rate, to find out how many clock cycles will occur between samples.

The NTSC clock frequency is 1022727Hz, slightly over 1MHz. The PAL clock frequency is 985248Hz, slightly under 1MHz. For a sample rate of 11025Hz, we get this:

NTSC:
	
1022727 / 11025 = ~92.76 cycles	per sample

PAL:
	
985248 / 11025 = ~89.36 cycles per sample

We could put the clock frequencies into constants and do the division in the program. But there are only a few sample rates in common usage, and doing division on the 6502 is kind of a pain. We'll just precompute the values for 4 common sample rates, 8000Hz, 11025Hz, 16000Hz and 22050Hz on both NTSC and PAL. And then put them in a table, as follows:

Each table entry consists of 8 bytes. 4 bytes for the sample rate to be compared against the 4-byte sample rate in the RIFF format data. The comparison is easier when these are the same length, even though the upper two bytes in our table are always $00, $00.

The next 2 bytes are an 8-bit timer value for NTSC and an 8-bit timer value for PAL. And the last two bytes are a pointer to a string representation of this sample rate.

Back to RIFF "fmt " testing

To test if the RIFF file's sample rate is one that GeoRAM Digi supports, we loop over the sample rate table looking for a match. To extend GeoRAM Digi to support additional sample rates, (like, 4000Hz, 6000Hz, 9000Hz, etc.) add these entries to the sample rate table, add the extra strings for outputting the sample rate to the screen, and make sure the code loops enough times to check all the entries in the table.

Load a zero page pointer with the address of the srtab. Test the first 4 bytes in the table against the format's samprate property. If they don't match increment the pointer to the next table entry. This is done by adding 8 (the size of a table entry) to the zero page pointer. After testing all the table entries if no match is found, we must JMP to errout with the message code $03, "Unsupported Sample Rate."

If a match is found, then the zero page pointer points at the correct table entry. Pull from that pointer the correct timer value, offset 4 for NTSC or offset 5 for PAL. The C64's KERNAL startup/initialization routines determine whether the machine is NTSC or PAL automatically by testing if the VIC-II's raster register reaches a rasterline that only exists on a PAL machine before returning to zero. The result of this test is stored at $02A6. 0 = NTSC, 1 = PAL.

We only need to lookup this CIA timer value once, and save it in a common variable called delay. When it comes time to configure the CIA timer, we can get the value from delay and no longer need to worry about sample rates, clock speeds, or NTSC/PAL differences.

From the sample rate table pointer, we can also pull out the string pointer, and JSR to $A1BE to output the human readable sample rate ("8000Hz", "11025Hz", etc.) to the screen.

The last format property we need to worry about is the number of bits per sample.

Test the sampbits for $0008 or $0010, 8-bit or 16-bit samples, respectively. If any other sample size than these, we will have to exit with an error, $04, "Unsupported Sample Size." The vast majority of .wav files will be either 8- or 16-bit. We will also print out a string about the detected sample size. This can be done with a JSR to strout so it'll end our metadata output with a carriage return.


Loading RIFF Audio Data

If we've made it through all the format testing without exiting with an error, we will eventually get to the "data" chunk. Remember, this is a proper chunk whose ID is "data" and has a 32-bit size of the chunk's data segment.

By the time we get to the "data" chunk, the "fmt " chunk has already been read in. That means we already have the sample size, and because of the testing we know it's either 8 or 16 bits. The DigiMAX is an 8-bit audio device. So if our samples are 8-bit we can simply load them into memory. But if the samples are 16-bit we will downsample them to 8-bit first. Let's start with the simple case, 8-bit samples. But first, let's think about how to manage GeoRAM's memory paging.

Managing GeoRAM Memory Paging

We're ready to load the audio data into memory. The file pointer is positioned right at the beginning of the audio sample data.

The trick now is, where to put the data. Most probably, it's going to be hundreds of kilobytes or even megabytes of data. Somehow, we have to funnel it into our GeoRAM in an orderly fashion. To help us out we'll use a few memory management routines to abstract the addressing and paging.

Recall now the discussion at the beginning of this post about the unusual 16K bank addressing of GeoRAM, and then the later discussion about detecting its capacity. We configured a variable called maxmem as a full 32-bit value of the total memory capacity, imagining that it has regular 64K banks.

The RIFF/WAVE "data" chunk has a data size that is a 32-bit number. It is easy to compare this against the the amount of memory we have because maxmem is in the same format as the data chunk's size.

While loading in data, we want to load page after page after page without worrying about any strange missing-bits/discontiguous addressing scheme. The first routine is setrampg (Set RAM Page, which, everytime I say it, I think, "Set Rampage!") maxmem is 32-bit, but the GeoRAM is only 22-bit, which we address with 3 bytes. The lo-byte is the index into the $DExx page, and the mid-byte and high-byte are written to $DFFE and $DFFF respectively.

Set RAM Page takes the upper 16 bits of the conceptually contiguous range. (X for the mid-byte, A for the high-byte.) It shifts those into the GeoRAM's discontiguous upper 14 bits, as we saw earlier, by rolling the 8 bits, shown below in green, up to 2 bit positions. Like this:

Normalizing GeoRAM's Addressing Scheme.
Normalizing GeoRAM's Addressing Scheme

X, the mid-byte, is written to a zero page value. Then we ASL (Arithmetically Shift Left) that mid-byte, which pushes its upper bit into the carry. Then we ROL (Roll Left) the accumulator. That shifts the high-byte's bits left and draws the carry bit in. Repeat this once more, and now .X and .A are in the discontiguous scheme required by GeoRAM, as shown in the bottom row above.3

This routine completely abstracts the GeoRAM's unusual addressing scheme. It treats it like an ordinary 24-bit addressing space, (maxed out, of course, at 4MB or 2MB, or however much RAM is physically available.)

To make things one step easier, I added two more memory helper routines: cnframpg (Config RAM Page) and incrampg (Increment RAM Page). Config RAM Page takes the same arguments as Set RAM Page, but it saves them locally before falling through to Set RAM Page. After calling Config RAM Page once, subsequent calls to Increment RAM Page don't require any arguments to be passed in. The locally stored, normalized mid- and high- bytes are pulled, incremented in-place, and passed on to Set RAM Page. You just call incrampg repeatedly and don't worry about what values are being set where. You just know that the page visible at $DExx is the next one available. Beautiful.

Loading 8-Bit Audio Data

Now we can use this to start loading in the 8-bit audio data.

The first problem is that the data size may exceed available memory. It's a simple check by comparing subchsz against maxmem (all 4 bytes). If subchsz is bigger than maxmem, we will copy maxmem overtop of subchsz, and then output the message "Memory Limit, Sample Data Truncated"

Whether or not subchsz was reduced, it gets copied again to datasz. Data Size will remain intact so we know how much audio data there is. subchsz, meanwhile, will get decremented as the data is loaded in, so we know when it reaches zero that we're finished loading data.

Starting at line 65, above, we kick off the process by calling cnframpg with $00 and $00.

subchsz now, despite the fact that it comes from the RIFF file as a 32-bit number, we know is only 24-bit, because if it was greater than memmax it got reduced to memmax and we know that memmax can't be larger than our GeoRAM's detected capacity.

The low byte, subchsz+0, holds some arbitrary number of bytes less than 256, we'll call this "One Partial Page." Additionally, the two high bytes, subchsz+1 and subchsz+2, hold the number of "Full Pages." The loading will be done then in two phases, all the full pages first. After each full page we call incrampg to move GeoRAM to its next page, and we do a 16-bit decrement on subchsz+1 and subchsz+2. Repeat to load in the next full page.

When the number of full pages reaches zero, we'll leave the full page loop and read in the last partial page, close the file and were done. All data has been loaded into the GeoRAM.

A nice addition is a progress meter so the user sees something going on. Load times are not exactly fast coming over the IEC bus. During the loop that loads full pages, we're decrementing subchsz+1. Each of these represents 256 bytes. If we mask away the upper 6 bits, the result will be zero every 4th page, or, every 1 kilobyte. Masking the upper 6 bits is basically a modulus 4. We can write out to screen a single character every kilobyte, just to show progress.


Loading 16-Bit Audio Data

Loading the 8-bit samples was very straightforward. Load a sample, stick it in memory, loop for a full page, increment the GeoRAM page, and carry on.

What happens if the samples are 16-bit? The DigiMAX can only output in 8-bit, so loading the full 16-bit samples into memory would not only be a waste of memory but an added burden dealing with 16-bit data in memory at playback time. It would be much better to convert them to 8-bit at load time. Thus, load 2 bytes from the file, mix them together down to an 8-bit sample and write only 1 byte into memory.

Converting 16-bit Digital Audio to 8-bit

There are two ways to convert a 16-bit number into an 8-bit number, and which way you use depends on what the data represent. Imagine you have a series of 16-bit numbers that hold the number of employees at businesses of various sizes. But for some reason you need to reduce to an 8-bit number. You have to decide whether you want to keep the most significant byte (the high byte) or the least significant byte (the low byte.) Which to keep is not 100% straightforward. If 99% of the businesses in your list have fewer than 256 employees, but you keep only the high byte, then you lose basically all of the information, reducing every company to 0, which means "less than 256." On the other hand, if you keep only the low byte, then you know down to the individual person how many employees there are, but only if there are fewer than 255. Otherwise, if there are more than 254, (aka, if the high byte is >0) you could set the low byte to 255 and take that to mean "more than 254."

There is nothing inherently wrong with either way. If you keep the low byte, you get absolutely perfect precision, but within a much smaller range and with the risk that some values will be saturated. If you keep the high byte, you maintain the magnitude of the original value but with a loss of precision.

When it comes to digital audio samples, we want to maintain the magnitude of the original value, and are willing to suffer the loss of precision because our memory is limited and our audio hardware can't handle the higher precision anyway.

The simple way to downsample then is to just toss the low byte and store the high byte. The low byte might be almost full, though, which means the high byte is much closer to +1 than it is to +0. The obvious way to deal with this is just basic rounding. If the low byte is greater than 127, increment the high byte. This works fine for some applications, but it also creates a sharp edge. If the low byte happens to be fluctuating around 127-128, then what was originally a small difference gets magnified, and this results in an audible noise.

The better solution is to do something called Audio Dithering. I can't really do justice to an explanation of how or why it works, but you can read more about it here: What is Audio Dithering? And Why it’s Used on Mastering.

Diagram of the effects of audio dithering on quantization errors.
Diagram of the effects of audio dithering on quantization errors

The gist is that you can improve the quality of the downsampling by reducing the effects of quantization errors by introducing noise to the signal. Sounds fancy, right? It's super easy to do. Rather than hard rounding the low byte at 127, you take the low byte and add to it a random 8-bit number. If combined they overflow 8-bits, then you increment the high byte. The result is that, sometimes even if the low byte is quite large, like say, 250, you could still by chance pull a very small random number, like 2, and combined they are just 252. That's less than 256 so you wouldn't increment the high byte, even though the low byte was really big. But in the same way, there will be times when the low byte is quite small, say 5, but you pull a random number that by chance is really big, like 252. Combined they exceed 255, and the high byte gets incremented, even though the low byte was very small. It sounds weird, but when applied over the whole set of samples, the effect improves the quality. This is why the technique is used by professional tools when downsampling to master to, say, CD.

Let's now see how we can implement audio dithering in our load16bit routine.

The overall structure of load16bit is the same as load8bit, but with a few key differences. The size of the data in memory will ultimately be just half the size of the data in the RIFF file. Therefore, to know whether the sample data will need to be truncated we will divide subchsz by 2, and copy that size into datasz. Then we'll compare the new datasz to maxmem. And if it's still bigger, then maxmem will get copied back to both subchsz and datasz.

There is one other thing to take care of. The format property's byte rate is the number of bytes per second of audio. Because we're cutting the data in half, we need to divide the byte rate in half too. This will become important later on when we implement playback controls.

We'll output a message to screen indicating that the data is being loaded and downsampled.

For our source of random numbers we'll use the SID's noise voice. Noise can be generated on voice 3. Set the voice 3 frequency to the maximum by writing $FF to $D40E and $D40F. Then activate the noise waveform on voice 3 but don't actually turn the voice on. $D412 controls voice 3. Bit7 sets the noise waveform and bit0 turns the voice on or off. Write %10000000 to $D412 to generate the noise data, without actually playing the noise out loud.

Once the noise is being generated, you can sample a pseudo-random 8-bit number by reading from the SID register at $D41B.

We now proceed to load in the full pages. subchsz has already been divided by 2, so we need to read 2 bytes from the data file on each loop. The downsampling and dithering could not be easier. Remember, RIFF data is little endian. So the first byte read is the 16-bit sample's low byte. Clear the carry and add (ADC) directly from $D41B. If the two values combined overflow, the carry will go high. Thus, the carry contains our dither bit. Push the processor flags to the stack to preserve the dither bit.

Next we load the second byte from the file, that's the sample's high byte. Restore the dither bit by pulling the processor flags from the stack and add with carry (ADC) #0. If the carry is high, the high byte gets incremented. Supporting 16-bit downsampling—even including audio dithering—is so easy, there is no excuse not to.

There is one more complication with handling 16-bit samples.

Converting Signed to Unsigned in Two's Complement

In RIFF/WAVE files, 8-bit samples are unsigned ints, 0 to 255, but 16-bit samples are signed ints, ranging from -32768 to 32767. We need to convert the 16-bit signed ints to 8-bit unsigned ints.

First, the conversion from 16-bit to 8-bit. The sign-bit is always the highest bit of the highest byte. And because our conversion to 8-bit keeps the high byte, this converts from signed 16-bit to signed 8-bit. In the new 8-bit sample, the dithering affects the lowest bits and the sign-bit remains the highest bit. Now we're converting a signed 8-bit to an unsigned 8-bit int.

Just as there was more than one way to convert a 16-bit number to an 8-bit number, there is more than one way to convert a signed int to an unsigned int, and neither way is canonically the right way. It depends again on what the data represent.

Suppose you have a signed int that is being used to hold either dollars had (positive value) or dollars owed (negative value.) And then for some reason you have to convert that to an unsigned int. If the original value was negative, you have a problem, because there is no way to represent a negative value in the unsigned int. If it's a positive value, on the other hand, there is nothing even to change, because the bit pattern of a positive signed int (0 to 127, 00000000 to 01111111) is identical to the bit pattern for that value range in an unsigned int. The conversion then consists of checking the high bit. If it's set, you have a problem that needs to be dealt with by the business logic. If it's unset, there is nothing else to do.

If you were going in the opposite direction, from unsigned to signed, you have precisely the same problem from the bit pattern perspective, but with a different semantic interpretation. The unsigned int can range from 0 to 255. If the value is from 0 to 127, conversion is no problem, and no bits need to change. But if the value is from 128 to 255, now you have a problem, because there is no way to represent a value that high in a signed int.

But there is another way altogether to convert signed to unsigned and vice versa, if the value represents a magnitude and you want to preserve that magnitude but on a different scale. In that case, a signed 8-bit int has a range of values, the smallest value is -128, the midpoint is approximately 0, and the largest value is 127. An unsigned 8-bit int has a range of values, the smallest value is 0, the midpoint is approximately 128, and the largest value is 255. A digital audio sample is clearly a magnitude. Therefore, we want to map the largest value to the largest value, and the smallest value to the smallest value, without caring how the precise numeric value changes.

  Signed Unsigned
Smallest %1000 0000 (-128) %0000 0000 (0)
Bigger %1000 0001 (-127) %0000 0001 (1)
Bigger %1000 0010 (-126) %0000 0010 (2)
Bigger %1000 0011 (-125) %0000 0011 (3)
Midpoint %0000 0000 (0) %1000 0000 (128)
Bigger %0111 1100 (124) %1111 1100 (252)
Bigger %0111 1101 (125) %1111 1101 (253)
Bigger %0111 1110 (126) %1111 1110 (254)
Biggest %0111 1111 (127) %1111 1111 (255)

We can see the pattern here. The only difference between the two columns is that the high bit is inverted. I love two's complement!

This makes good sense. If we ignore bits completely, the range -128 to 127 gets rescaled to 0 to 255 simply by sliding all the values up, adding 128 to everything. Now back to bits, you take the bit pattern of the signed value, and simply reinterpret them as an unsigned int and add 128. If the high bit is clear when you add 128, the high bit will be set. But, if the high bit is already set, it will overflow 255, wrap around to the same value but with the high bit clear. Which is the same thing as just inverting the high bit, except that it's faster to EOR the high bit than to do an ADC, because the EOR doesn't need to worry about the state of the carry.4

The same logic applies when conceptualizing it the other way around. If you want to take a value from 0 to 255 and rescale it to -128 to 127, you are sliding all values down by subtracting 128. If the original value started out with the high bit set, subtracting 128 will just unset the high bit. But if the value started out less than 128, with the high bit unset, the subtraction will wrap the value back around in the other direction, and the high bit will be set.

Two's complement is cool.

The whole downsampling process then goes like this:

  1. Read the low byte from the file.
  2. Add to it a byte of noise from the SID.
  3. The carry now holds the dither bit.
  4. Push the dither bit (PHP) to the stack.
  5. Read the high byte from the file.
  6. Convert it from signed to unsigned by EOR #$80.
  7. Pull the dither bit (PLP) from the stack, and
  8. ADC #0 to incorprate the dither bit into the new unsigned 8-bit sample.

Configuring the NMI-Based Play Routine

Whether we loaded 8-bit samples or loaded and converted 16-bit samples, as soon as the data loading is done the file is closed and execution leaves the RIFF chunk processing loop by JMPing to datadone.

Besides printing a message that loading data is done, several steps need to be taken to prepare to play the sample.

The C64 has two CIA chips. Both have two 8-bit data ports. Both have two 16-bit timers. Both have an interrupt pin. CIA 1 uses both of its ports to scan the keyboard matrix, and one of its timers to drive the main system interrupt. CIA 1's interrupt pin connects to the CPU's IRQ pin. Thus, 60 times a second, CIA 1 triggers a CPU IRQ which performs some common tasks. The task we care about most is that it scans the keyboard, which it does by interacting with CIA 1's ports. There's a lot of CIA 1 action going on here, and the IRQ already has one job.

CIA 2 uses its ports for a mixture of purposes. Port A is used to control the VIC-II's memory bank, to implement the IEC serial bus, plus 2 bits go to the User Port. Port B, on the other hand, is used completely for driving something on the User Port. The DigiMAX is connected to the User Port, so CIA 2 is responsible for outputting the audio samples. CIA 2's interrupt pin connects to the CPU's NMI pin. NMIs have a higher priority than IRQs, and nothing else is using the NMI. So we'll use one of CIA 2's timers to drive the audio playback.

Besides CIA 2, two other things connect to the NMI pin: A line on the Cartridge Port, and the Restore key. The Cartridge Port will be occupied with the GeoRAM, which doesn't affect the NMI line. But the user could have a cartridge port expander and additional cartridges plugged in. We're going to state as a caveat, this routine won't work right if some other cartridge starts driving the NMI line, or pulls it low and holds it low.

As for the Restore key, we don't have to worry about that. First, it cannot be used to hold the NMI line low, because it goes through a 556 timer that automatically brings the line high even if you hold the key down. And you can't push it fast enough to have an appreciable effect considering that the CIA will drive the NMI at a minimum of 8000 times a second already. I worried that you might hear a small click at the moment you push the Restore key, but in my testing I haven't noticed that.

We'll get into the playback controls at the end of this post, but suffice it to say, pausing playback will consist of masking the interrupt bit from the timer inside the CIA. You can't mask an NMI (Non-Maskable Interrupt) on the CPU, but you can mask the timer in the CIA so the timer keeps cycling, but it doesn't pull its interrupt pin low when it reaches zero.

To prevent a possible crash when we adjust the CPU's NMI vector, we'll call the pause routine first. A crash would occur if by some fluke an NMI was triggered after we changed one byte of the NMI vector but before we'd changed the next. Between changing the high and low bytes the vector is momentairly pointing to something unexpected.

Here's the code for datadone:

CIA 2's Ports will be used for driving the DigiMAX, so the data direction of all bits are set as outputs. The VIC-II's memory bank bits need to be outputs, and the IEC serial bus no longer needs to be used.

Next up is to configure CIA 2's timer. We'll configure Timer A. It's 16-bit but the number of cycles between samples of our lowest supported sample rate is only 128. So the timer's high byte can take a hardcoded zero, and the timer's low byte is loaded from the delay variable that was pulled earlier from the sample rate table. It already contains the correct value for this RIFF/WAVE file's sample rate and the clock rate of this NTSC or PAL machine.

The timer has various configurable properties. We'll set it to load the latched value, ignore any inputs from Port B, use "continuous firing mode," which means as soon as it hits zero it resets to the latched start value and continues counting down again automatically. And we'll set the bit to start the timer ticking.

Lastly, we need to assign the routine to call when an NMI occurs. The CPU's hardware NMI vector is at $FFFA/$FFFB, near the very end of the CPU's addressing range. When the KERNAL is patched in, the KERNAL configures the NMI vector to pass through an indirect JMP() through a RAM vector at $0318/$0319, and it also executes an SEI to mask interupts. The KERNAL's redirect adds 7 cycles of overhead to every NMI. We'll see soon how precious these cycles are. We could recover at least 5 of these cycles by patching out the KERNAL and pointing the hardware NMI vectors directly at our play routine. But then the KERNAL can't be used to scan the keyboard and GeoRAM Digi would have to implement that and other stuff to handle the IRQ. I've opted instead to leave the KERNAL patched in, and we'll see where that leaves us in the play routine.

Theoretically a single NMI play routine could check to see how many channels of audio this sample has and branch depending on whether it's mono or stereo. But, time inside the NMI routine is incredibly precious, and we don't want to do any work there that could be done ahead of time. Therefore, there are two different play routines: play1chn and play2chn. The former is assigned to the KERNAL's NMI RAM Vector if there is one channel, and the latter if there are two channels.

Playing a Sample via NMI

Our data is in the GeoRAM, it's in a format that can be played by the DigiMAX. And we have the CIA 2 Timer A configured to generate an NMI at a regular interval, the right interval to output one sample per call and it will sound right. What do we have to do in that call to send the sample to the DigiMAX?

Here's where the rubber really hits the road for the Commodore 64. In my first post about this from August 2020, I greatly over estimated how much free CPU time there would be left over. A typical IRQ handler, the one that scans the keyboard and blinks the cursor, and in C64 OS moves the mouse pointer, is called 60 times a second. That felt like a lot to me because even a slow old CPU is unintuitively fast. 1,022,727Mhz divided by 60 is 17,045 cycles per IRQ. The normal IRQ handler has to run its code in the first few hundred cycles of a block of ~17,000 cycles. This is clearly not a problem.

But a digital audio sample that's just a crappy 8000Hz has to interrupt the CPU 8000 times a second, on top of the IRQ that's already interrupting it 60 times a second. As we saw in the Sample Rate Table—but didn't spend much time thinking about—at 8000Hz the play1chn routine has 127 cycles on NTSC and 123 cycles on PAL. At a more respectable but still pretty low quality 11025Hz, it has 93 cycles on NTSC and 89 cycles on PAL. The routine must not take even one cycle more than these limits, or another NMI will be triggered before the first one is even finished.

Consider now that some instructions take 6 or even more cycles to execute. For example, incrementing an absolute memory address takes 6 cycles. We're talking about a hard limit of just 15 instructions like this, and the CPU would be 100% saturated at only 11025Hz. With this in mind, now let's look at what needs to be done.

Entering and Exiting the NMI Handler

Before we get to do any work of our own, there is overhead that can be counted up in cycles for getting the CPU to switch contexts, to handle the NMI task and then switch back.

Interrupt Sequence.
Interrupt Sequence.

I found this diagram in a brilliant article about interrupts on the 6502 by Garth Wilson. (Investigating Interrupts.) I learned that an NMI, IRQ and BRK all share the same internal mechanism, with some minor variation. When the NMI line goes low the current instruction is given time to finish. This could take several cycles. Then the CPU goes through a 7-cycle sequence before the first instruction of the NMI handler is executed.

Of our 93 cycles, 7 are reliably taken right away, and possibly a few more. The first instruction of the NMI handler is in the KERNAL ROM, which does an SEI and a JMP() through the RAM vector. Together these take up an addition 7 cycles. Plus at the end we must do an RTI, which takes up 6 cycles. Now we're up to 20 cycles of overhead, possibly a few more.

At least we're into our own code where we have some measure of control over what we can optimize. But I wouldn't quite say the overhead is finished yet, because there are still some steps that must be taken. For example, the CIA Timer sets an internal interrupt bit. It won't generate another interrupt until we read from its interrupt status register. This is an absolute address ($DD0D), but fortunately the read can be performed by a BIT which will not affect the main registers. It will however take 4 cycles.

Any of the A, X or Y registers that will be changed must be backed up (presumably to the stack) and must be restored at the end. The processor status flags are backed up and restored for us as part of the initial 7-cycle interrupt sequence and the RTI, respectively. The 6502/6510 cannot push the X or Y registers directly to the stack, nor directly pull them from the stack. They have to be transferred to and from the A register. A push takes 3 cycles, a pull takes 4 cycles. That's 7 cycles to push and pull the A register. TXA, TAX, TYA and TAY each take 2 cycles. So to back up and restore either X or Y takes 2 to transfer, 3 to push, 4 to pull and 2 to transfer, or 11 cycles total. If we need to use all three registers, that means 7 for A, 11 for X and 11 for Y or an additional 29 cycles of overhead!!! This would bring us up to a whopping 53 cycles of overhead. On 93 cycles total that's 56% of total available time.

Fuck a duck! Pardon my language.

When I said I overestimated the time available, I meant it. There is literally, no time. I assumed 22050Hz would be the fastest we could probably support. On an NTSC machine there are only 46 cycles between samples, or only 45 cycles on PAL. But there are around 53 cycles of overhead just to get into and out of the NMI routine. We haven't even started to talk about reading samples from GeoRAM, sending to DigiMAX, incrementing addresses or changing GeoRAM pages, and we're anywhere from 8 to 15 cycles in the hole. At 16000Hz, there are 64 or 62 cycles for NTSC and PAL. The routine would have 10 cycles to work with, which is enough for from 2 to 5 instructions max.

I won't say that handling 16000Hz or 22050Hz on the C64 is impossible. But I'm going to go out on a limb here and say it's impossible to do it by means of an NMI handler and a CIA Timer. There is simply too much overhead involved that robs the last precious few cycles necessary to pull it off. I think the only way to pull off 22050Hz would be with a single tight continuous loop. Given that you get all the NMI overhead back, there might be time to scan the keyboard to see if anything is pressed. That can be done pretty fast.

But frankly, I'm not that into demo-style hacking and slashing to pull off the incredible. Using an NMI for playback and allowing the IRQ to keep processing the keyboard, (and mouse,) and possibly eking out a few cycles in between to update the screen, is a technique that is much more plausibly applicable to use within a C64 OS app or some other program someone might want to write. What it means is that we eat a little 1MHz-sized humble pie, and settle for 8000Hz and 11025Hz as our two primarily supported sample rates.

Just like with converting 16-bit samples to 8-bit, downsampling to 11025Hz at loadtime is always an option. It would at least allow us to hear the file, albeit with reduced quality, rather than throwing an error and hearing nothing.

So let us move forward with 11025Hz, and not worry about 22050Hz. Where were we? 93 cycles minus 53 cycles of overhead, leaves us with 40 cycles to process the sample. Here's what I came up with:

Whooo! This is the exciting part. Starting with 93 cycles for NTSC or 89 cycles for PAL, the above routine takes 76 cycles total (plus possibly a few more, based on what was being executed when the NMI dropped,) on the calls that play a sample from within a GeoRAM page. Or 79 cycles (plus...) total on the call that needs to switch the GeoRAM to the next page. It fits, just, and it took some work to figure out.

Here's the first thing I figured out. It is possible to do this without using the Y register. Therefore, by not backing up and restoring it, we get back 11 cycles. All the variables that will be referenced are in zero page, because that shaves a cycle or two off each reference.

In general, we need a play index. The play index (playidx) is a 32-bit number that maintains the current position within the data. The low byte playidx+0 gets incremented on every sample, and is used to index into $DE00 to fetch the sample byte from the current GeoRAM page. The routine is divided into two halves, the top half that outputs bytes from the current page, and the bottom half that increments the GeoRAM page. Let's take the top half first.

It takes 14 cycles to get into the routine. 4 cycles to clear the CIA Interrupt byte. 8 cycles to backup X and A registers. 3 cycles to load the current playidx into X. 5 cycles increment playidx for next time. By the way, it is no faster to INX and STX, those take 2 cycles and 3 cycles respectively, for a total of 5, but then X has the wrong value. Next, if the INC playidx resulted in zero, that means the current X is $ff, the last byte of the GeoRAM page. This will branch to the bottom half of the routine. But on most calls this will not branch, and a branch not taken is 2 cycles. 4 cycles loads the sample from $DE00,X.

Next we have a nice-to-have, though not strictly necessary. We'll output the same mono sample to both DigiMAX channels so we hear the audio in both our ears (especially if you're wearing headphones.) It takes 10 cycles to load a channel number and write it to the DigiMAX then write the sample byte to that channel. Do that again for the other channel for 10 more cycles. And finally, 10 cycles to restore X and A registers, and 6 to RTI. For a total of: 76 Cycles, leaving some buffer space for the extra cycles needed for a long instruction that gets interrupted to finish.

Here's an interesting hardware discovery I've made. The DigiMAX employs a quad-8-bit DAC chip. There are actually 4 independent 8-bit DACs. However, two are hardware mixed to the left channel, and 2 are hardware mixed to the right channel. If one of those DACs were designated as a special mono output, it could be hardware split to output on both channels. This way, in software, when we know we're outputting mono data, rather than killing 10 cycles writing the same sample to two different channels, we could write one sample to the 3rd DAC and still hear it in both our ears. That's just a thought I had. Since, we don't really have time to make full use of 4 DACs in the same routine anyway.

You might think saving 10 cycles isn't that much. But it's 10 * 11025, which is 110250 cycles per second or nearly 10% of total CPU usage!

Changing the GeoRAM Page

Every 256 samples we have to move to the next GeoRAM page. To do this we have to increment the playidx mid-byte, and if that rolls over increment the playidx high-byte too. Each of these takes 5 cycles, plus the testing branch for the rollover will take either 2 or 3 cycles. It'll take 3 cycles to skip the second increment, so it's 12 cycles with the rollover or 8 cycles without. The next problem is that, they're in the normalized addressing mode, not the GeoRAM's discontiguous addressing mode. We have to load them into A and X, 3 cycles each, and write the final outcome addresses to $DFFE and $DFFF, these are absolute writes, 4 cycles each.

So far that's 12 + 6 + 8 = 26 cycles, not including the address translation. But the first half of the routine took 76 cyces. We can't add 26 more cycles to change the GeoRAM page, that would put us at 102 cycles, 10 cycles more than we've got. My solution to this is to skip the output of every 256th sample. In the top half of the routine we grabbed X and incremented playidx. If the increment results in zero, it skips outputting that sample by branching to the bottom half of the routine. The branch takes 3 cycles which puts us at 37 cycles up to that point. 37 cycles plus 26 is 63 cycles, and the bottom half of the routine reimplements the code that restores the X and Y registers and does the RTI for 16 more samples, which puts us at 79 cycles.

In my testing, skipping one sample out of every 256 is not audible. But we still have a problem. The address bytes aren't in the right addressing scheme to be written to the GeoRAM's page registers. We talked about this earlier. This is the disadvantage of normalizing and translating the GeoRAM's addressing. The translation requires writing the mid-byte to a temporary zero page address for 3 cycles, two zero page left shifts, (ASL zp) 5 cycles each, and two accumulator left rolls, 2 cycles each, for a total of 17 cycles. 79 plus 17 is 96, more than the allotted space for NTSC or PAL, and no buffer room for the long instruction that gets interrupted. Even skipping the sample output on this round, there isn't enough time to compute the new GeoRAM page.

The golden rule here is, don't do anything inside the NMI that you don't have to.

Let's say that for 255 samples we could get this routine down to 76 cycles, and 79 cycles for 1. That means that for 255 samples in a row, we get 93 - 76 = 17 cycles per sample, 255 * 17 = 4335 cycles—chopped into 255 pieces—but 4335 cycles nonetheless before we need to change the GeoRAM page. Also, the IRQ will only interrupt that 4335 cycles less than twice on average and only needs a couple hundred cycles anyway, if that. In other words, there is actually a lot of CPU time free while playing 11025Hz audio, because the spare cycles add up. The math is straightforward, 76 / 93 = 0.82, or ~82% of the CPU is used to play 11025Hz mono audio. Which means we have ~18% of the CPU time left over. Let's see now what sort of UI we can implement in that left over 18%.


User Interface and Keyboard Controls

When introducing GeoRAM Digi and listing its features, I gave a list of supported keyboard keys for playback control.

Keyboard support
  • SPACE = Pause/resume playback
  • s = Stop
  • ← = Rewind
  • F1 = Skip back 5 seconds
  • F7 = Skip forward 10 seconds
  • +/- = Toggle screen on/off
  • r = Repeat mod on/off
  • CTRL+Q = Quit to READY.

Remember, the regular IRQ is still running. It scarcely interferes because it runs only 60 times a second. While playing 11025Hz audio it runs only once every 183 samples. And the NMI takes priority over the IRQ, so even if the IRQ is running, it won't interfere with the audio timing.

The IRQ scans the keyboard so the keyboard buffer is being populated with key presses. Therefore, after hooking up the NMI routine, the main code can enter an infinite loop and keep checking the keyboard buffer with getin on every loop. If there is nothing in the keyboard buffer it just keeps looping until there is.

Once a key is detected, it does a series of compares to check if it's a supported key, and then performs some simple task for each key. For instance, CTRL+Q (which puts one byte in the keyboard buffer) JMP()'s through the hardware reset vector. The KERNAL takes over from there and resets all the I/O chips, etc. effectively stopping the audio and returning us to the READY. prompt.

"+" and "-" flip the "screen on/off" bit in the VIC-II's $D011 register. When the screen is on, though, the VIC-II puts the CPU to sleep for ~40 cycles every 8th rasterline of the bitmapped screen area. Given the high frequency of the NMI, the CPU will often be asleep and get delayed in processing it. This has a noticeable but not catastrophic impact on the audio quality. It's much better to leave the screen off, but it's worth a temporary decrease in quality to be able to flip the screen on and see what might be going on, or, in the case of a C64 OS app, move the mouse and click a button.

"r" toggles a repeat variable between 0 and 1.

The other keys, SPACE, s, ←, F1 and F7 call a series of playback control routines. SPACE toggles between calling pause and resume. S calls dostop ("stop" is a KERNAL call), ← calls rewind, F1 and F7 call back5 and fwrd10.

Playback Control Routines

The playidx determines where the audio is playing from. Additionally, there's a variable for playing that is toggled between 0 and 1 to indicate whether the audio is playing.

Resume

Resume checks to see if it's already playing, and if not sets playing to 1, turns off the screen (for optimal sound quality), set's the GeoRAM's current page with a call to setrampg (Set RAM Page) with playidx+1 and playidx+2. It sets these because while playback was paused something else may have moved the play index. And lastly, it enables CIA 2 Timer A's interrupt bit. While paused or stopped the timer keeps ticking, but its interrupt is masked.

Pause

Pause checks to see if it's already paused, and if not sets playing to 0, and masks CIA 2 Timer A's interrupt. It doesn't move the play index, and it doesn't reenable the screen.

DoStop

I would have called this stop but the KERNAL ROM already has a stop routine, that scans the keyboard matrix for the STOP key. Dostop turns the screen on, calls pause and also calls rewind.

Rewind

The heart of rewind is to reset the play index (all 4 bytes) to zero. It's a bad idea to do this while the sample is playing however, because the NMI could interrupt while in the middle of changing the play index bytes. It's just like changing the NMI vector without stopping the source of NMIs first. Rewind wouldn't cause a crash, but it could output data from somewhere unexpected.

After resetting the play index we should then resume playback. But we can also hit rewind while currently paused. And we want to remain paused. So, it backs up the playing variable, calls pause, changes the play index, pulls the playing state from the stack and if it was playing before it calls resume. The main reason I don't have the pause routine enable the screen is because I don't want the screen to be turned on momentarily when pushing buttons that pause, move the play index, and resume.

Back5

The gist of back5 is, like rewind, to move the play index. It will pause, move the play index, and resume just like rewind. The trick is, we want it to decrement the play index by 5 seconds worth of samples. How many samples is that? It depends on the sample rate, the sample size, and the number of channels. Fortunately, the RIFF file's "fmt " chunk includes the Byte Rate property. Byte Rate is a precomputed number of bytes per 1 second of audio. Brilliant.

If the sample is 8000Hz, 16-bit, Stereo, Byte Rate will be 8000 * 2 * 2 = 32000 bytes. We downsample from 16-bit to 8-bit samples, but, if you'll recall, we also divide Byte Rate by 2 to account for that. To skip back 5 seconds then, we just have to subtract Byte Rate from the play index 5 times. If, however, we are closer to the beginning of the sample than 5 seconds, this subtraction will cause an underflow. If playidx ends up bigger than datasz, then playidx is set back to zero.

Fwrd10

We know what back5 does, fwrd10 does basically the same thing. It pauses playback, adds Byte Rate to playidx 10 times. If you were near to the end of the sample this could easily overflow. So, once again it checks if playidx is greater than datasz, except this time if it is, it sets playidx equal to datasz (the end of the sample) rather than zero (the beginning of the sample.)


Ending Playback, Repeat, and Intermediate Calculations

The NMI handler routine is so close to the limit that there well and truly is no time to do anything that doesn't absolutely have to be done in the NMI routine.

One of the most obvious is, what happens when playback reaches the end of the audio data? The NMI handler (play1chn or play2chn) increments the play index, but it doesn't check to see if it's reached the end.

Inside the top of the infinite loop that's processing keyboard commands, remember this loop has ~4300 cycles, split up into many short bursts by the NMI interrupting it, before a GeoRAM page has to be changed. Therefore, this routine has the spare time to handle various intermediate calculations.

It pulls playidx+1 and playidx+2 into X and Y, and increments them as registers (leaving the original play index unmodified). It writes the low byte X into a zero page address, and transfers the high byte Y into the accumulator. It then does the ASL zp/ROL A pair twice to convert the play index addressing into the GeoRAM addressing, and saves those bytes to geopghi and geopglo. So, these two bytes are in the correct addressing scheme for the next GeoRAM page.

Somewhat unnecessarily, it keeps recomputing these on every loop, but it's not that bad because there is nothing else to do besides check the keyboard for presses. What happens though is that when the NMI hits that finds itself on the last byte of the current GeoRAM page, it skips playing that sample (to save time), it increments the play index, but rather than mapping the new play index to the GeoRAM addressing in realtime, which it has no time for, it grabs these precomputed bytes and writes them to the GeoRAM page registers. Sneaky.

Additionally, inside this infinite loop, since it's got time to kill, it checks to see if the playidx is equal or greater than datasz. If it is, then playback has reached the end of the sample. It checks the repeat variable, if repeat is on then it calls rewind and the sample carries on playing from the start. If repeat is off, then it calls dostop. These checks don't need to be done in the NMI handler, so, they're not done there.


Supporting Stereo Playback

We're almost done now. The last thing to discuss is, how to support playing back stereo audio data.

Loading mono and stereo data is done in exactly the same way. If the audio samples are mono, then every sample gets written to both channels. If the audio samples are stereo, then left and right channel samples alternate. Left channel, right channel, left channel, right channel, etc.

The play2chn routine is very similar to play1chn. Here it is:

We have exactly the same timing constraints. We cannot exceed 93 cycles on NTSC or 89 cycles on PAL. Plus, we should have a few buffer cycles in case the NMI interrupts a long running instruction that takes a few cycles to complete.

The main difference when playing stereo data is that the play index advances twice as fast. On each NMI playidx gets incremented twice, once for the left channel, once for the right channel.

I monkeyed around with this for a while to try to figure out the fewest number of cycles to do it in. And maybe someone a lot more clever than me will see something that I don't see, but here's how I've done it. Some of this is just comparing speeds of different ways to accomplish the same thing.

In play1chn, where the play index incrememnts once, the following:

	LDX zp  ;3 cycles
	INC zp  ;5 cycles

is not faster or slower than:

	LDX zp  ;3 cycles
	INX     ;2 cycles
	STX zp  ;3 cycles

Both take the same number of cycles but X ends up one bigger after the second way. On the other hand, if you need to increment that zero page address twice, such as play2chn has to do, then the following:

	LDX zp  ;3 cycles
	INC zp  ;5 cycles
	INC zp  ;5 cycles

is slower than:

	
	LDX zp  ;3 cycles
	INX     ;2 cycles
	INX     ;2 cycles
	STX zp  ;3 cycles

It's slower by 3 cycles. But the faster way leaves the X index 2 values bigger than the slower way.

Now, if the INX's land us back on zero, we're on the last 2 samples of this page, we'll skip the output of these samples so there's time to change the GeoRAM page. If we're not on the last two samples and we need to output them, then we have to: Set DigiMAX left channel, read left sample, write to DigiMAX, Set DigiMAX right channel, read right sample, write to DigiMAX. But because X is two bytes ahead, the sample reads have to come from $DE00-2,X for left channel and $DE00-1,X for right channel. These now cross a page boundary, for a 1 cycle penalty for each read, a net +2. But we saved 3 cycles with the INX INX method above, so we're 1 cycle better off despite crossing the page boundary. And, if we're on one of the NMI calls that will skip the sample output, we don't even suffer the page boundary crossing penalty and are a full 3 cycles ahead for the second half of the routine.

The second half of the play2chn routine is identical to the second half of the play1chn routine.

The final score is: 84 cycles on 127 NMIs that output the samples, and 81 cycles on the 1 NMI that increments the GeoRAM page. 84 / 93 = 0.90, so playing stereo data at 11025Hz takes up ~90% of the CPU time. This still leaves a healthy ~10% of CPU time to handle the UI and do those intermediate calculations.

Crazy.


Wrapping up, and Final Thoughts

From the looks of it, a program like this is pretty bare bones. Hardly any user interface, more like the output of a UNIX-y command line tool. And the final binary isn't very big. 2,400 bytes exactly, including all the human readable strings. This is not a big program. But when you get into the details of what it takes to play digital audio this close to on bare metal,5 there are quite a few components that go into the whole.

Here's a list of the major sections of GeoRAM Digi:

  • Variables, structures, strings and messages
  • GeoRAM detection, presence and capacity
  • RIFF header loading and checking
  • RIFF fmt chunk loading and checking
  • Sample rate lookup
  • CIA 2 configuration
  • NMI vector configuration
  • Intermediate memory and play index calculations
  • Keyboard press UI
  • Opening and closing a file
  • Data loading helpers
  • RIFF 8-bit data loading
  • RIFF 16-bit data loading and downsampling
  • Playback Controls
  • Mono NMI handler routine
  • Stereo NMI handler routine
  • GeoRAM memory management routines

This blog post is 22,000 words and 20 GitHub Gists. Below is a video showing GeoRAM Digi working. And links to the GitHub respository for the full source code and pre-assembled binary.

I hope that someone will find this deep dive to be an interesting and educational tutorial on a wide variety of topics, from GeoRAM and DigiMAX, to CIA and NMI programming, to the RIFF file format, to two's complement and 6502 assembly programming and optimization. Even if no one reads this,6 I totally love this stuff.

Here are some sample .wav files I've been using as tests.

  1. RIFF files store their multi-byte numbers in little endian, just like the 6502. That means the first byte in the file, or the byte lowest in memory is the least significant byte.
  2. Another comment about endianness. IFF, used on older Macs and Amiga, is big endian because the 68K processor is big endian. Microsoft's RIFF is little endian because the x86 CPU is little endian.
  3. Technically the X register's upper two bits are still set to something, but it doesn't matter because when they get written to $DFFE those upper 2 bits are ignored. The byte in zero page is no longer needed.
  4. An immediate EOR takes 2 cycles. And immediate ADC also only takes 2 cycles. But to use ADC accurately you'd have to clear the carry first. And CLC takes an additional 2 cycles.
  5. The only parts not on bare metal are the KERNAL's IEC loading routines, the BASIC routine to output strings to the screen and the KERNAL's IRQ handler that scans and populates the keyboard buffer for us.
  6. Here's a fun little easter egg to see if anyone actually reads this stuff. This will probably only depress me, to find out I'm writing for an audience of 1 (or none.) If you read this, Tweet "GeoRAM Digi easter egg" to @gregnacu. Hehe.

Do you like what you see?

You've just read one of my high-quality, long-form, weblog posts, for free! First, thank you for your interest, it makes producing this content feel worthwhile. I love to hear your input and feedback in the forums below. And I do my best to answer every question.

I'm creating C64 OS and documenting my progress along the way, to give something to you and contribute to the Commodore community. Please consider purchasing one of the items I am currently offering or making a small donation, to help me continue to bring you updates, in-depth technical discussions and programming reference. Your generous support is greatly appreciated.

Greg Naçu — C64OS.com

Want to support my hard work? Here's how!