Speed considerations

   This is as side result of my recent making of audio background player for Atari STE, and of course with all my long term experience with programming.

What makes code fast, or better said executing fast ?
Generally: using fast instructions, and as little times as is possible. Good algorythm and concept. From first statement results usage of Assembler - then code overhead may be minimal. Compilers can not make so efficient code as experienced human. Usage of ASM is especially welcome on slow machines as Atari ST serie.

The concrete case: audio samplerate conversion on-fly :
Hippy Dave on Atari forum posted his code for doing samplerate conversion with comment that final goal is audio CD playback on STE machines. He meant for sure not usual way: using CD ROM drive's audio output, but playback via STE DMA audio, using digital audio data from disk. Regular CD data rate is not too big, some 160 KB/sec. Parameters are: samplerate 44100 Hz, Stereo, 16-bit PCM, signed. But STE (and TT, Falcon) have different samplerates : 50066, 25033 etc.
So, framerate conversion during playback is necessary (or in case of extracting CD audio for later playback from Atari hard disk). Quality conversion needs pretty much calculation, as framerates are not dividable.
I said in that discussion that on-fly samplerate conversion is not possible, even not with 16 MHz CPU. I was wrong. It is possible, even with 8 MHz CPU (STE). But it is on limit. Uses about 80 % of CPU time, so not much left for load from disk. Concrete: over 1100 KB/sec transfer rate is required for flawless playback without breaks. UltraSatan may achieve it, but not with all SD cards. 1GB Kingston what I have can 1170 KB/sec, and it plays well. But if you move mouse intensive it will repeat part of WAV as result of too much total CPU load .
To achieve this, a lot of time was spent and many things changed in original code. + it is result of little luck: I did testings with regular MS WAV file, what is same format as CDA . Luck is in fact that byte order is reversed (Intel), what saves come cycles by reading src. data. In first time as benefit I considered signed format too, as it saves also some cycles by conversion, but I think now that it is not relevant when use precalculated table - as it can be made with unsigned-signed conversion included.

How it went:
First step is usually establishing algorythm. For quality samplerate conversion. I thinkered following method (don't know is it used by some SW) :
Making time-scale, where we have sample points of src. and sample points of dest. , which are of course on different positions. In first variant I choosed 16-bit time scale for better quality. Then , as in our case dest rate is bigger we will have src. sample points on every 65536th pos. Dest on every src/dest rate x 65536th pos.
It means that we need only one multiplication in calculation, and instead even slower division we can use fast swap command (as division is by 65536).
The dest sample value will be weighted average of 2 adjacent src. sample points. Of course closer to value of closer point, in proper ratio. I think that it is pretty good method, and not too much CPU demanding. May be not accurate with very high frequencies , but whole digital sampling with 44100 is not accurate too with high freqs.

First converting code variant:

* Conversion loop :

a1 - src, a2 - dest . In d3, d4 previous samples for L and R .
d7 - dest sample pos, d6 - dest sample pos step .

nextSamp
   move.w (a1),d2    * Next sample
   ror.w   #8,d2    * Reverse MSB - LSB order
   ror.w   #8,d3
* Find value between :
   sub.w d3,d2    * B-A
   smi     d0    * Set flag if signed
   muls    d7,d2

   swap   d2    * Div by 65536 in fact, result is 16-bit signed value needed

* but mult was only by 32768 !

   lsr.w   #7,d2   * We need 8 MSB bits, where sign bit is too
   tst.b     d0
   beq.s    act1
   bset    d1,d2
   bra.s act2
act1 bclr d1,d2

act2
   lsr.w   #8,d3
   add.b   d3,d2
   move.b   d2,(a2)+

move.w (a1)+,d3

   tst.b   stereo
   beq.s   notSt2

   move.w (a1),d2
   ror.w   #8,d2    * Reverse MSB - LSB order
   ror.w   #8,d4
   sub.w d4,d2    * B-A
   smi    d0
   muls    d7,d2
   swap   d2    * Div by 65536 in fact, result is 8-bit value needed
   lsr.w   #7,d2
   tst.b     d0
   beq.s    act3
   bset    d1,d2
   bra.s act4
act3 bclr d1,d2

act4
   lsr.w   #8,d4
   add.b   d4,d2
   move.b   d2,(a2)+

   move.w (a1)+,d4

notSt2
   add.w d6,d7   * next sample point for output
   bclr d5,d7    * test and clear bit 15
   beq.s InBetween

* If buffer end reached must reload or save :

   cmp.l   a1,a4
   bgt.s   notSrcE
   bsr    reloadSrc
   tst.l    d0
   beq   finIsh
.........

There is more code for diverse cases as 'in between case' when 2 adjacent dest samples occur without src. sample between them etc .

Here worked with complete 16-bit src sample values, for better quality (in theory)..
Conversion quality was good,. Time of conversion more than 3 x audio duration time. Far from on-fly ...

   Little later variant, with only 8-bit time base :

nextSamp
   move.w (a1),d2    * Next sample

* Find value between :
   sub.b   d3,d2    * B-A
   beq.s   ch1Same    * skip mult etc. if same
   smi     d0   * flag
   bmi.s   ch1sned
   and.w   d5,d2
   bra.s   ch1mul

ch1sned
* actually, just set bits 15-8 and reset bit 7!

* and.w #$007F,d2
* and.w   d5,d2
* or.w   #$8000,d2
or.w   d1,d2

ch1mul
   muls    d7,d2
* Got 16 bit value, but sign is in bit 32

   lsr.w   #8,d2   * We need 8 MSB bits, where sign bit must be too
   tst.b    d0
   beq.s ch1nsn   * bit 7 = 0 for sure if before mult was d2 plus
   bset d6,d2
ch1nsn
ch1same   add.b   d3,d2
   move.b   d2,(a2)+

move.w (a1)+,d3

   tst.b   stereo
   beq.s   notSt2

   move.w (a1),d2
   sub.b d4,d2    * B-A
   beq.s   ch2Same    * skip mult if same
   smi     d0   * flag
   bmi.s   ch2sned
*   and.w #$007F,d2
   and.w   d5,d2
   bra.s   ch2mul

ch2sned

*   and.w #$007F,d2
*   or.w   #$8000,d2
* and.w   d5,d2
or.w   d1,d2
ch2mul
   muls    d7,d2
   lsr.w   #8,d2
   tst.b    d0
   beq.s ch2nsn   * bit 7 = 0 for sure if before mult was d2 plus
   bset d6,d2
ch2nsn
ch2same   add.b   d4,d2
   move.b   d2,(a2)+

   move.w (a1)+,d4

notSt2
   add.b step(pc),d7   * next sample point for output
   bcc.s InBetween

* If buffer end reached must reload or save :

   cmp.l   a1,a4
   bgt.s   notSrcE
   bsr    reloadSrc
   tst.l    d0
   beq   finIsh

This worked faster some 30 %. Quality was same, I guess. In any case, by using 8-bit time base, tempo error is still less than 0.5 %. And samples are max. 8-bit in dest.
Next, logical step was usage of table with precalculated output values. Required table size is 64KB, as result of 2 x 8-bit variables: sample point and value (diff. in fact). It made most of speed gain.

First 'table' variant:

nextSamp
   move.w (a1),d2    * Next sample

* Find value between :
   sub.b   d3,d2    * B-A
   beq.s   ch1Same    * skip mult etc. if same

   move.w   d7,d1
   move.b    d2,d1

   move.b 0(a3,d1.l),d2

ch1same   add.b   d3,d2
   move.b   d2,(a2)+

   move.w (a1)+,d3

   tst.b   stereo
   beq.s   notSt2

   move.w (a1),d2
   sub.b d4,d2    * B-A
   beq.s   ch2Same    * skip mult if same

*    move.w   d7,d1   * no need again
   move.b    d2,d1

   move.b 0(a3,d1.l),d2

ch2same   add.b   d4,d2
   move.b   d2,(a2)+

   move.w (a1)+,d4

notSt2
   add.w d6,d7   * next sample point for output
* d7, d6 in step 256 - to avoid slow shift
   bcc.s InBetween

* If buffer end reached must reload or save :

   cmp.l   a1,a4
   bgt.s   notSrcE
   bsr    reloadSrc
   tst.l    d0
   beq   finIsh

It was almost good for on-fly. And code had still place for improving :
The 'slow' parts:
1. Reading src twice .
2. beq.s ch1Same takes too many cycles., while in most cases is not same .
3.   move.b 0(a3,d1.l),d2 ; add.b   d3,d2   seq can be replaced with one add. only .
4. Buffer end testings take too much CPU time, as doing them after every 2 written dest. bytes, while buffers are much bigger (64KB).
Latest was hardest to solve - main problem was that src and dest are asynchron.

The final code:

* phase 2 - last values in d3,d4
nextSamp2   * phase 2
   move.w (a1)+,d2    * Next sample
   move.w   d2,d0    * remember
   sub.b   d3,d2    * B-A
   move.w   d7,d1
   move.b    d2,d1
   add.b 0(a3,d1.l),d3
   move.b   d3,(a2)+   * output

* Other channel :
   move.w (a1)+,d2
   move.b d2,d5   * last values to d0, d5
   sub.b d4,d2    * B-A
* d1, d7 still same
   move.b    d2,d1
   add.b 0(a3,d1.l),d4
   move.b   d4,(a2)+

* core of conversion takes only 13 instructions for 2 channels !

* in phase 3 now - do, d5 used to store last values

   add.w d6,d7   * next sample point for output
   bcc InBetween3   * Phase 3

   Phases are intruduced, and end test goes only after 8-bytes of dest. written. Code for sync. buffer loadings is now pretty long, but it executes very rare. Speed is gained by decreasing instruction count in often repeating parts.   No more mono-stereo test, as it takes too much time too. Code only for stereo. For mono needs just slighty different one. But size of code is almost 3 x bigger. Still, not too much - conversion part about 1KB.

Further things affecting speed:
   The way how GEMDOS loads data from hard disks: it tends to do it as fast as possible, but must care that supply only so much data as is asked - to prevent overwriting user program's space where code, data may be. If SW asks only 123 bytes, GEMDOS will write only so much to dest. But loading from hard disks goes different - only complete sectors can be readen. So, GEMDOS uses sector buffers for cases of less data needed. Then driver loads not to dest, but in sector buffers, and GEMDOS copies from it to dest as many bytes as asked. All it is slower than direct reading from disk to dest. And we need max possible speed here.
Solution is to load always data chunks of size multiple of sector sizes. Even better, multiple of cluster sizes. Start position of data inside file must be also on such pos - multiple of cluster size. Then always, data will go direct to dest, what is fastest possible.
In practice it was so: There is 44 byte long header in regular WAV files. When I was made player with on-fly conversion, table variant 1, player loaded 44 bytes to get parameters, and after that started palyback from pos 44. With UltraSatan used, close to limit there was 2-3 case of break during 3 minute song. After correcting code in smart way - replacing header data in buffer with silence and therefore having perfect audio data align at pos 0 practically: playback was flawless on STE with UltraSatan.
It was with 48000 Hz source btw. Probably with 44100 src. it would play well even with 'bad' align. But we solved it for 48000 too, thanx to smart programming, and knowledge about how system works.

   Practical usage
Listening CD audio from CD ROM drive without audio output (is there such ?) .
   Grabbing CD audio for later listening from Atari hard disk - in proper STE format.
   Listening regular WAV files on Ataris, without previous conversion.
   Fast converting of regular WAV files into WAE format.

   Well, first 3 does not seems too attractive: using CD drive's audio output is better way, I guess. Storing 16-bit WAV files on Atari drives takes 2 x more space than in STE format. Maybe CDA grabbing is something really useful - as it can go really fast, together with conversion in proper, space saving format. But it needs fast CD ROM and fast driver SW. In case of interest, I will do necessary program variant for that purpose - with direct playback option, together with necessary speed test tools. But very likely there will be problem with byte order - what means less speed.

   Download test version    Usage: this will work correct only with CDA format audio : 44100 Hz, Stereo, 16-bit signed. Such files are with CDA, BIN or RAW extension - then no header. Or MS WAV with same params - then there is a header too. MS WAV may be 48000 Hz too. Name file TEST.WAV or TEST.RAW (later for RAW or CDA).   If both names are present in DIR, WAV will play - as long it is, then just exit. If only TEST.RAW present, then it will play, of course. This works not together with background player - deinstall ACC before run . Needs 256KB free RAM and no any background process active (full CPU time). Will start playback after some 2 sec pause - table building is slow. In later versions table will be just loaded from disk.