I’ve been (very) slowly working my way through various project Euler problems as a side project. For some reason, calculating primes became a fascination for me. I eventually wrote a fast, one line python script that uses a sieve. Not long after that, I decided to revive my C64 hobby.

I’ve had a Commodore since I was 10 or 11.. Started writing assembly when I was 14 or so, and soon discovered the world of demos in the late 80’s. I’ve always wanted to do demos, and even coded a few simple ones as a teenager.

So coming back to the C64 after 25 years, refreshing my 6502 limited coding skills, I set to work. I set a challenge for myself. Could I get this C64 to calculate the prime numbers up to 1,000,000. At first glance, this seems impossible. Then on a lark, I decided to add another challenge: Could I do it in less than one minute.

Firstly, the C64 only has 65536 bytes of memory total. There are 74,000+ primes less than 1,000,000. How would I store them? Even the amount of processing seems to be too much. Having to do multi-byte math on an 8bit processor. Looping over the data hundreds of times. I wasn’t too concerned by the time. I knew if I could get everything to fit in memory, given enough time, it could calculate it.

The first challenge was storing the data. Going to the disk drive for swap is notoriously slow. First thing is first, can I eliminate some of the problem set? Why yes. I can cut the problem set in half, by only needed to look at the odd numbers. So now we’re down to 500,000 numbers. Next on the elimination list is anything divisible by 5. We already eliminated anything that ends in zero, so now we eliminate anything ending in 5. This is another 100,000 numbers. So we are down to 400,000. Still way exceeding the limits of the C64.

My break through came when I figured out that I do not have to store the actual numbers. I just need something to represent each number. In this case, I need to represent if a number is prime or not. A boolean variable. A bit. Which means I can store 8 numbers in a single bit. There’s 400,000 numbers I need to store, and at 8 numbers per byte, I need 50,000 bytes. Yes folks, I can fit the data into the C64’s memory!

To access a particular bit, I used two different numbers. One is an offset, and another is the index into the byte. Each byte contains 8 numbers, but in theory actually represents 20 numbers, but with the unneeded ones removed. So the index is always under 20.

To calculate the primes, I’m using a sieve. You can read about it here: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

In order to save time doing loop addition in order to multiply (remember, the 6502 doesn’t have multiply or divide instructions), I decided to pre calculate some stuff. What I noticed is that there’s a pattern to the multiplication. Starting with the prime we’re working with, we actually add 2 times the prime (2 X prime) to get the next number, which will not be prime. This skips the numbers that will result in even numbers. The resulting number is really (prime X 3). If we added the (2 X prime) again, we get the result of (5 X prime). We don’t really need this, since we already eliminated all stuff time 5 in our storage. So we add the (2 X prime) again, which really means that we’re adding (4 X prime). The result is (7 X prime). Again, we add (2 X prime) to get (9 X prime). To loop back around, we go one more time, (2 X prime) to give us (11 X prime). If you continue this pattern, you’ll notice that you’re adding like this: 2X, 4X, 2X, 2X.. Here’s a couple examples:

prime = 3

3 + (2 x prime) = 9

9 + (4 x prime) = 21

21 + (2 x prime) = 27

27 + (2 x prime) = 33 (loop around to beginning)

33 + (2 x prime) = 39

39 + (4 x prime) = 51

etc...

prime = 7

7 + (2 x prime) = 21

21 + (4 x prime) = 49

49 + (2 x prime) = 63

63 + (2 x prime) = 77 (loop around top beginning)

77 + (2 x prime) = 91

91 + (4 x prime) = 119

etc...

And so on. So what I did at the beginning of each prime number, was to pre-calculate (2 x prime) and (4 x prime). Store that into a table that looks like this:

(2 x prime), (4 x prime), (2 x prime), (2 x prime)

Due to the way I’m storing the numbers, I’m actually using 3 bytes. 2 bytes for the offset, and a single byte for the index. So when I do my addition, I add the table’s index to the current value’s index, check to see if it carries, if so, increase the offset by one, then add the table’s offset. That gives me the next value I need to turn off.

I start with a prime of 3, pre-calculate my adding table described above, and then start adding. On each add, I turn off a bit. I keep adding until I equal or exceed a value of 1,000,000. In this case, that is when the offset is greater than 50,000.

Once I do that, I search the data for the next prime. In this case it’s 7. Then when it searches after 7, it see that the 9 is not prime, and continues to 11. And so on and so on until we hit the square root of 1,000,000, which is 1,000.

So thats the general algorithm that I used.

Some implementation details. First, I didn’t write the division routine. I used one that I found online, and then modified it slightly. I didn’t need the full 32 bits that it was calculating, and I also didn’t need the error check. The routine did exactly what I wanted though, it allowed me to divide by 20.

Why 20? Because thats the number of ‘numbers’ stored in a byte. This means that the quotient is my offset. The remained tell me which bit I need to turn off, but how do I go from 20 to 8 bits? With a table:

.byte $00, // 0 %11111110, // 1 $00, // 2 %11111101, // 3 $00, // 4 $00, // 5 $00, // 6 %11111011, // 7 $00, // 8 %11110111, // 9 $00, // 10 %11101111, // 11 $00, // 12 %11011111, // 13 $00, // 14 $00, // 15 $00, // 16 %10111111, // 17 $00, // 18 %01111111 // 19

The reminder from the division is the index to this table. This tells me exactly which bit I need to work with. I look at the data in the table via the offset, and AND it with the value from this table. Then store it back in the data at the offset.

To get the speed, I played the ‘follow the carry bit’ game. I looked through my code for places where I could remove a CLC(CLear Carry) or SEC (SEt Carry) before an ADC (ADd with Carry) or SBC(SuBtract with Carry). Sometimes I reordered something so I didn’t have to do one of those. Every one I removed was 2 cycles per loop iteration.

At one point in time, I had some self modifying code. I eventually got rid of it, because it wasn’t giving me enough, and I needed to the space.

Speaking of space, this whole implementation takes 1K. Thats 1024 bytes. That also includes a summing routine that adds all the prime numbers together after the sieve is complete. I needed to keep it under 1K, because of the memory layout. If it was any larger, I would have to play games moving pieces of the app around in memory, turning off the kernel and I/O pages. It was bad enough i turned off BASIC.

All in all, the sieve run in 55 seconds. Not too shabby for a 30 year old computer running at 1Mhz. I’m sure some of the demo scene people could get it faster, probably a lot faster. This happens to be my first 6502 assembly program in nearly 25 years.

And now the source. For my development environment, I use my MacBookPro with OSX, Kick Assembler, and VICE. I’ll also show my makefile, so others can build this easily. After I got it working, I transfer it to my real, stock, c64, and ran it there as a verification. See my other blog post on how I transferred it. Also, here’s a video of it running on my machine: https://www.youtube.com/watch?v=bl8DUJAPyFU

**Makefile:**

debug : disk # generate a monitor file from the symbol file awk -F' |=\\$$' '{print "al " $$3 " ." $$2}' primes.sym > primes.mon # Read the source file, and add any 'vars' to the monitor file grep "^.var.*\\$$" primes.s | awk -F' | = \\$$' '{print "al " $$3 " ." $$2} ' >> primes.mon echo "break .break" >> primes.mon echo "break .break2" >> primes.mon x64.app/Contents/MacOS/x64 -ntsc -warp -moncommands primes.mon primes.d64 disk : primes x64.app/Contents/MacOS/c1541 -format blah,00 d64 primes.d64 -attach primes.d64 -write primes.prg primes primes : primes.s java -jar kick/KickAss.jar primes.s

**Source (Kick Assmebler code):**

// Prime number finder. First version will seive the primes up to 100. // A basic loader. Generates: // 10 SYS (2064) .pc = $0801 "Basic upstart" :BasicUpstart(start) // A few vars .var bits = $0c00 .var end_bits = bits + $C350 // 50000 bytes .var curr = $19 // some random place in the zero page that seems safe .var curr_index = $1e .var curr_mask_index = $21 .var adder_index = $02 .var bit_mask_index = $41 .var bit_index = $43 .var index_adder_1 = $57 .var index_adder_2 = $59 .var index_adder_3 = $5b .var index_adder_4 = $5d .var mask_adder_1 = $5f .var mask_adder_2 = $60 .var mask_adder_3 = $61 .var mask_adder_4 = $62 .var BIT7 = $fb .var BIT3 = $fc .var calculating_location = $07b0 //.var bit_mask = $87 // 20 bytes! .var divide_amount = $14 // Stuff for divide routine .var divisor = $d8 // 2 bytes .var dividend = divisor + 2 // 4 bytes $da .var div_scratch = dividend + 4 // $de .var div_carry = div_scratch + 1 // $df .var quotient = dividend .var remainder = dividend + 2 .var display_msg_work = $e0 .var display_msg_ptr = $e3 .var index_and_mask_to_hex_work_index = $e4 .var index_and_mask_to_hex_work_hex = $e7 .var index_and_mask_to_hex_work_mask = $eb .var hex_lo = $ec .var hex_hi = $ed .var bit_mask = $87 .pc = $0810 // 2064, Enough room over the BasicStart start: // Prep some vars lda #$80 sta BIT7 lda #$08 sta BIT3 // Turn off the basic rom lda #$36 sta $01 // Back up the zero page, so I can reload basic afterwards // This is going into the IO area, but all writes go to RAM, no // matter how things area configured ldx #$ff copy_zp: lda $00,x sta $e000,x dex bne copy_zp // Black out the screen lda #$00 sta $d021 sta $d020 // clear the screen ldx #$00 lda #$20 clear_screen: dex sta $0400,X sta $0500,X sta $0600,X sta $0700,X bne clear_screen // Display the message lda #$17 // Y pha lda #$0a // X pha lda #[clearing_memory_msg_end - clearing_memory_msg] // length pha lda #<clearing_memory_msg pha lda #>clearing_memory_msg pha jsr display_msg // Clear some memory lda #$FF ldx #>end_bits outer_clear_mem: stx clear_mem + 3 // self modding code. ldy #$00 clear_mem: dey sta bits,y // Self modding code. bne clear_mem dex cpx #>bits - 1 bne outer_clear_mem // Copy 20 bytes for the bit mask. ldx #$00 copy_table: lda bit_mask_table, x sta bit_mask, x inx cpx #$14 // compare to 20 bne copy_table lda #divide_amount sta divisor lda #$00 sta divisor + 1 // We know that 1 is not prime, so lets turn that bit off now. lda #$FE sta bits // Some initialization lda #$00 sta bit_index lda #$03 sta bit_mask_index lda #>bits sta bit_index + 1 // Start with 3 as the first number to seive lda #$03 sta curr lda #$00 sta curr + 1 sta curr + 2 sta curr + 3 sta curr + 4 // Display the message lda #$17 // Y pha lda #$0a // X pha lda #[calculating_msg_end - calculating_msg] // length pha lda #<calculating_msg pha lda #>calculating_msg pha jsr display_msg next_number: // Copy the number to the accum. // Add the 'curr' to the accum num to get a 2x of the curr in // accum. Store that in adder_1, adder_3 and adder_4 lda curr sta dividend lda curr + 1 sta dividend + 1 lda curr + 2 sta dividend + 2 // Save these so that they are zero lda #$00 sta dividend + 3 jsr divide // Save the quotient to the curr_index, for later use lda quotient sta curr_index lda quotient + 1 sta curr_index + 1 // Save the remainder to the curr_mask_index, for later use lda remainder sta curr_mask_index // Now multiply curr_index and curr_mask_index by 2 to get the adders asl // remainder cmp #$14 bcc mask_adder_skip_carry_1 sbc #$14 sec // Make sure the carry flag is set for the rol. mask_adder_skip_carry_1: sta mask_adder_1 sta mask_adder_3 sta mask_adder_4 lda quotient rol sta index_adder_1 sta index_adder_3 sta index_adder_4 lda quotient + 1 rol sta index_adder_1 + 1 sta index_adder_3 + 1 sta index_adder_4 + 1 lda mask_adder_1 asl // remainder cmp #$14 bcc mask_adder_skip_carry_2 sbc #$14 sec // Make sure the carry flag is set for the rol. mask_adder_skip_carry_2: sta mask_adder_2 lda index_adder_1 rol sta index_adder_2 lda index_adder_1 + 1 rol sta index_adder_2 + 1 // Now we have our adders. Lets start adding. lda #$FF sta adder_index add_loop: // Determine which adder we're using. inc adder_index // Do the bin version first lda adder_index and #$03 // Store this, its for the index_adder tax asl // multiple by 2 tay // store it temporarily // Figure out the mask_index // No need for clc here. lda mask_adder_1, X adc bit_mask_index // Are we >= 20 cmp #$14 bcc no_mask_carry // Yes, subtract 20 from this, and carry to the bit_index // Note that carry flag is set here. sbc #$14 // Let the carry flag fall through to the adc bit_index after the no_mask_carry. no_mask_carry: sta bit_mask_index // Now figure out the bit_index tya tax // We don't want to clc here, since it may fall from the sbc above. lda index_adder_1, X adc bit_index sta bit_index lda index_adder_1 + 1, X adc bit_index + 1 sta bit_index + 1 // Are we done? We already have the byte we want in there, no need to load cmp #>[end_bits - 1] bcc r_we_done_2 // <, if its less than, skip the second test r_we_done: lda bit_index cmp #<[end_bits - 1] bcc r_we_done_2 // < Both of these mean we continue beq r_we_done_2 // = Both of these mean we continue jmp find_next_num r_we_done_2: // We now have the index we need, and an offset to our bit mask. Turn that // bit off. ldy #$00 ldx bit_mask_index lda bit_mask,X and (bit_index),y sta (bit_index),y jmp add_loop find_next_num: // Find the bit/mask index for the curr num clc lda #>bits adc curr_index + 1 sta curr_index + 1 // curr_mask_index has the mask index ldx curr_mask_index next_num_loop: inx inx cpx #$14 bcc no_mask_carry_2 // No clc needed here txa sbc #$14 tax lda curr_index adc #$00 sta curr_index lda curr_index + 1 adc #$00 sta curr_index + 1 no_mask_carry_2: lda bit_mask, X beq next_num_loop // skip 0's // Now check the bit in that location ldy #$00 ora (curr_index), Y cmp #$FF bne next_num_loop // Store Y in the remainer stx curr_mask_index stx bit_mask_index lda curr_index sta bit_index lda curr_index + 1 sta bit_index + 1 // Calc the number from the index and mask. // First sub the offset from the curr_index + 1 lda #>bits sec sbc curr_index + 1 sta curr_index + 1 lda curr_index sta curr lda curr_index + 1 sta curr + 1 lda #$00 sta curr + 2 sta curr + 3 // Multiply by 16 ldx #$00 multi_loop: asl curr rol curr + 1 rol curr + 2 rol curr + 3 inx cpx #$04 bne multi_loop // Times 4 lda #$00 sta curr_index + 2 asl curr_index rol curr_index + 1 rol curr_index + 2 asl curr_index rol curr_index + 1 rol curr_index + 2 // Add in the mask index clc lda curr_mask_index sta bit_mask_index adc curr_index sta curr_index lda #$00 adc curr_index + 1 sta curr_index + 1 lda #$00 adc curr_index + 2 sta curr_index + 2 // Now add'em clc lda curr adc curr_index sta curr lda curr + 1 adc curr_index + 1 sta curr + 1 lda curr_index + 2 adc curr + 2 sta curr + 2 lda #$00 // continue in case we carry? adc curr + 3 sta curr + 3 // Display something jsr display_curr // Have we hit the square root? lda curr +1 cmp #$03 bcc b_we_done_2 // <, if its less than, skip the second test beq b_we_done // =, need to do the second test jmp summit b_we_done: lda curr cmp #$e8 bcc b_we_done_2 // < Both of these mean we continue beq b_we_done_2 // = Both of these mean we continue jmp summit b_we_done_2: jmp next_number summit: // Display the message lda #$17 // Y pha lda #$0a // X pha lda #[summing_msg_end - summing_msg] // length pha lda #<summing_msg pha lda #>summing_msg pha jsr display_msg // Sum things together. lda #$07 // 2 and 5 are prime. sta curr lda #$00 sta curr + 1 sta curr + 2 sta curr + 3 sta curr + 4 lda #<bits sta curr_index lda #>bits sta curr_index + 1 ldx #$01 // Start on the first bit, next it'll advance to the 3rd bit next_num_loop_2: ldy #$00 inx inx cpx #$14 bcc no_mask_carry_3 // No clc needed here txa sbc #$14 tax lda curr_index adc #$00 sta curr_index bcc no_mask_carry_3 inc curr_index + 1 no_mask_carry_3: // Are we done? lda curr_index + 1 cmp #>[end_bits - 1] bcc r_we_done_4 // <, if its less than, skip the second test r_we_done_3: lda curr_index cmp #<[end_bits - 1] bcc r_we_done_4 // < Both of these mean we continue beq r_we_done_4 // = Both of these mean we continue jmp end_prg r_we_done_4: lda bit_mask, X beq next_num_loop_2 // skip 0's // Now check the bit in that location ora (curr_index), Y cmp #$FF bne next_num_loop_2 stx curr_mask_index // We now have the index and mask, convert to a hex value. txa sta index_and_mask_to_hex_work_mask lda curr_index sta index_and_mask_to_hex_work_index lda curr_index + 1 sec sbc #>bits sta index_and_mask_to_hex_work_index + 1 jsr index_and_mask_to_hex clc lda index_and_mask_to_hex_work_hex adc curr sta curr lda index_and_mask_to_hex_work_hex + 1 adc curr + 1 sta curr + 1 lda index_and_mask_to_hex_work_hex + 2 adc curr + 2 sta curr + 2 lda index_and_mask_to_hex_work_hex + 3 adc curr + 3 sta curr + 3 lda #$00 adc curr + 4 sta curr + 4 // Display something jsr display_curr ldx curr_mask_index jmp next_num_loop_2 end_prg: // copy the zp back sei // Let us see the io page. lda #$34 sta $01 // Copy the data back to the zp. ldx #$02 copy_zp_3: lda $e000,x sta $00,x inx bne copy_zp_3 // Renable the io page and basic. lda #$37 sta $01 cli break: rts display_curr: ldx #$05 ldy #$FE display_loop: dex iny iny txa pha tya pha lda curr, x jsr get_hex pla tay pla tax lda hex_hi sta calculating_location, y lda hex_lo sta calculating_location + 1, y cpx #$00 bne display_loop rts get_hex: // Take a byte in A, and stores the chars to display in x and Y tay and #$f0 lsr lsr lsr lsr tax lda print_table,x sta hex_hi tya and #$0f tax lda print_table,x sta hex_lo rts divide: ldx #$11 // over at the same time as shifting the answer in, the // operation must start AND finish with a shift of the // low cell of the dividend (which ends up holding the // quotient), so we start with 17 (11H) in X. clc div_loop: lda #$00 sta div_carry // Zero old bits of CARRY so subtraction works right. rol divisor+2 // -JMS- Move low cell of dividend left one bit, also shifting rol divisor+3 // -JMS- answer in. The 1st rotation brings in a 0, which later // gets pushed off the other end in the last rotation. dex // beq div_end // Branch to the end if finished. // rol divisor+4 // -JMS- Shift high cell of dividend left one bit, also rol div_carry // Store old high bit of dividend in CARRY. (For STZ // one line up, NMOS 6502 will need LDA #0, STA CARRY.) sec // See if divisor will fit into high 17 bits of dividend lda divisor+4 // -JMS- by subtracting and then looking at carry flag. sbc divisor // First do low byte. sta divisor+6 // Save difference low byte until we know if we need it. lda div_carry // Bit 0 of CARRY serves as 17th bit. sbc #$0 // Complete the subtraction by doing the 17th bit before bcc div_loop // determining if the divisor fit into the high 17 bits // of the dividend. If so, the carry flag remains set. lda divisor+6 // If divisor fit into dividend high 17 bits, update sta divisor+4 // -JMS- dividend high cell to what it would be after bcs div_loop oflo: lda #$FF // If overflow occurred, put FF sta divisor+2 // in remainder low byte sta divisor+3 // and high byte, sta divisor+4 // and in quotient low byte sta divisor+5 // and high byte. div_end:rts display_ret: .byte 0, 0 display_msg: pla sta display_ret pla sta display_ret + 1 pla sta display_msg_work + 1 pla sta display_msg_work pla sta display_msg_work + 2 // Length pla tax pla tay lda #$04 sta display_msg_ptr + 1 lda #$00 sta display_msg_ptr add_y: cpy #$00 beq done_y clc lda #40 adc display_msg_ptr sta display_msg_ptr lda #$00 adc display_msg_ptr + 1 sta display_msg_ptr + 1 dey jmp add_y done_y: txa clc adc display_msg_ptr sta display_msg_ptr lda #$00 adc display_msg_ptr + 1 sta display_msg_ptr + 1 ldy #$00 write_to_screen: lda (display_msg_work),Y sta (display_msg_ptr),Y iny cpy display_msg_work + 2 bne write_to_screen lda display_ret + 1 pha lda display_ret pha rts index_and_mask_to_hex: lda index_and_mask_to_hex_work_index sta index_and_mask_to_hex_work_hex lda index_and_mask_to_hex_work_index + 1 sta index_and_mask_to_hex_work_hex + 1 lda #$00 sta index_and_mask_to_hex_work_hex + 2 sta index_and_mask_to_hex_work_hex + 3 // Multiply by 16 ldx #$04 multi_loop_2: asl index_and_mask_to_hex_work_hex rol index_and_mask_to_hex_work_hex + 1 rol index_and_mask_to_hex_work_hex + 2 rol index_and_mask_to_hex_work_hex + 3 dex bne multi_loop_2 // Times 4 lda #$00 sta index_and_mask_to_hex_work_index + 2 asl index_and_mask_to_hex_work_index rol index_and_mask_to_hex_work_index + 1 rol index_and_mask_to_hex_work_index + 2 asl index_and_mask_to_hex_work_index rol index_and_mask_to_hex_work_index + 1 rol index_and_mask_to_hex_work_index + 2 // Add in the mask index clc lda index_and_mask_to_hex_work_mask adc index_and_mask_to_hex_work_index sta index_and_mask_to_hex_work_index lda #$00 adc index_and_mask_to_hex_work_index + 1 sta index_and_mask_to_hex_work_index + 1 lda #$00 adc index_and_mask_to_hex_work_index + 2 sta index_and_mask_to_hex_work_index + 2 // Now add'em clc lda index_and_mask_to_hex_work_hex adc index_and_mask_to_hex_work_index sta index_and_mask_to_hex_work_hex lda index_and_mask_to_hex_work_hex + 1 adc index_and_mask_to_hex_work_index + 1 sta index_and_mask_to_hex_work_hex + 1 lda index_and_mask_to_hex_work_hex + 2 adc index_and_mask_to_hex_work_index + 2 sta index_and_mask_to_hex_work_hex + 2 lda index_and_mask_to_hex_work_hex + 3 // continue in case we carry? adc #$00 sta index_and_mask_to_hex_work_hex + 3 rts clearing_memory_msg: .text "clearmem " clearing_memory_msg_end: calculating_msg: .text "calulating:" calculating_msg_end: summing_msg: .text "summing :" summing_msg_end: // I should see if I find space in the zero page for this. 20 bytes bit_mask_table: .byte $00, // 0 %11111110, // 1 $00, // 2 %11111101, // 3 $00, // 4 $00, // 5 $00, // 6 %11111011, // 7 $00, // 8 %11110111, // 9 $00, // 10 %11101111, // 11 $00, // 12 %11011111, // 13 $00, // 14 $00, // 15 $00, // 16 %10111111, // 17 $00, // 18 %01111111 // 19 print_table: .byte $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $01, $02, $03, $04, $05, $06, $07

I wonder what the fastest one could do this in BASIC would be - without resorting to ML subroutines and PEEK/POKE memory manipulation. Due to memory limitations, you'd have to resort to sending to disk, printer, or even modem/RS232 output. I remember writing these sorts of things as a kid, but as someone who wasn't a mathematical prodigy, more often than not my efforts were thoroughly brute-force.

ReplyDelete