Introduction

Over a year ago, I contributed am mpg123 binding for MPlayer to replace the old fork of mpg123 code dubbed "mp3lib". I can say "old" since we incorporated the added optimizations from that one plus added new stuff from Taihei;-)

Generally, mpg123 performs well compared to MPlayer, but there is a performance hit when using libmpg123 from MPlayer, making it slower than the built-in mp3lib. This is mostly apparent on older systems, where possibly cache misses are a source of trouble.

Specifically, the regression has been reported on K6-III+, of which I happen to have a unit available for testing. It has been confirmed for Athlons (no XP), too. Well, I happen to have a good old Duron 800 at my disposal. That should do the trick there.

Test setup

The software baseline is a current stable debian squeeze (installed from 6.0.3 media), with MPlayer and mpg123 pulled from SVN and built enforcing binaries for a K6-III (also on the Duron), adding in 3DNowExt optimizations via custom assembly on either side.

Furthermore, the data set is, for the sake of comparability with earlier results, the MP3 album "Points of View" by "Convergence". To make sure, I talk of these files with the respective MD5 hashes:

04d953da593d8cfa43aee4d00572ca37 \
 /dev/shm/convergence_-_points_of_view/01 - Bleed.mp3
2999166bab6fe443f6a2c15c8eb5d938 \
 /dev/shm/convergence_-_points_of_view/02 - Strike the end.mp3
d324e12b3f20bea69c9c5e01252d0d96 \
 /dev/shm/convergence_-_points_of_view/03 - Listen.mp3
e0215065002ef2492a8d88c356609c21 \
 /dev/shm/convergence_-_points_of_view/04 - Six feet under.mp3
2b19190e15b5e5a805ccfacd789b01ab \
 /dev/shm/convergence_-_points_of_view/05 - Always the same.mp3
502c4d1e5793352fb25e6e90d6480415 \
 /dev/shm/convergence_-_points_of_view/06 - Breath.mp3
80bffd4e4bd6cc2f330b08e8156070ad \
 /dev/shm/convergence_-_points_of_view/07 - Vanished memories.mp3
de5ebd6e36eebf8c77642fa52f9610bd \
 /dev/shm/convergence_-_points_of_view/08 - Silent.mp3
c4c1167b5ca29c061d75f6104eec9283 \
 /dev/shm/convergence_-_points_of_view/09 - Nothing else.mp3
df624cee3f7b65be9b98e449e122ac63 \
 /dev/shm/convergence_-_points_of_view/10 - Train to leave.mp3

As you may notice: I try to eliminate the hard disk at least for the test data by moving stuff to RAM. Even the K6-III+ has no issue with sparing some of its 512 MiB of RAM (the duron setup has 768 MiB).

The command lines try to do comparable things in mpg123 and MPlayer:

mplayer -ac mp3 -quiet -ao pcm:file=/dev/null \
   /dev/shm/convergence_-_points_of_view/*.mp3
mplayer -ac mpg123 -quiet -ao pcm:file=/dev/null \
   /dev/shm/convergence_-_points_of_view/*.mp3
mpg123 -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3

I'm trying to combine timings with good old 'time' and measurements with 'perf'. The latter does not give much information on the K6-III+ with respect to cache misses, so I hope to do the grunt work on the Duron and hopefully the findings will transfer to the K6.

As an additional measure to avoid a penalty for libmpg123 loading, the libmpg123.a is linked in statically into MPlayer. The final solution will hopefully make the dynamic lib perform about as well (considering that stand-alone mpg123 does this successfully).

The build of MPlayer is configured via

./configure --extra-cflags='-march=k6-3 -mtune=k6-3 -g'

the mpg123 build via

CFLAGS="-march=k6-3 -mtune=k6-3 -g -O3" ./configure --disable-shared

That results in this set of CFLAGS: -O2 -fomit-frame-pointer -funroll-all-loops -finline-functions -ffast-math -march=k6-3 -mtune=k6-3 -g -O3 Those are largely similar to what MPlayer uses, but not identical. Thinking about this... some of those might be even damaging. But: As long as standalone mpg123 has such a clear lead over MPlayer, I doubt that those make much of a difference.

The situation

How do my K6-III binaries perform on the two systems? Well, let's record the user time to decode the test album (no statistics as long as the result is rather clear).

run	user time K6-III+ / s	user time Duron / s	ratio K6 / Duron
mpg123-debian	51.6	22.4	2.30
mpg123	50.4	22.3	2.26
mplayer-mp3lib	56.5	25.4	2.24
mplayer-mpg123	56.5	28.7	2.33

This is using run-time choice of 3DNowExt routines. It might be interesting to compare runtimes of generic C code, the difference attesting to the respective assembly routine quality. But let's stay focused on the case that is important in practice, and that is using 3DNowExt as the most performant choice on these CPUs.Let's express runtime as percent of plain mpg123 runtime:

run	user time K6-III+	user time Duron
mpg123-debian	102.4 %	100.4 %
mpg123	100.0 %	100.0 %
mplayer-mp3lib	112 %	113.9 %
mplayer-mpg123	112 %	128.7 %

Ivan noted a performance penalty of 8 % on his Athlon. Well, I can offer even more: MPlayer with mpg123 binding adds 13 % on the runtime of MPlayer with mp3lib. Cache size could explain that somewhat, as the Duron does have less of that than an Athlon, though it was not considered that much of a performance difference back then when people cared about comparing Athlons and Durons. Anyhow, good starting point to dig at the reason for the difference, eh? I do remember trying out generic decoders on the K6-III+, there the effect was even larger. If things are unclear, I can resort to that.

Is that unrolling of all loops in mpg123 default flags actually any good?

I'm testing the separately. I have the feeling that that's too much. With full unrolling I get around 22.0 s user time, with normal unrolling, too. It's not decisive without more statistics.

Glimpses with perf

Now let's hook up with Linux Performance Events. On the K6-III+, the information was rather meager, I hope for more stats on the Duron as a more modern / better supported CPU. But perhaps it's just my novice usage of perf, which I didn't know before. Oh, hm. Let's see. Ah, yes... all the goodies Ivan used in his perf invocation relies on a PMU being present. That's not the case for the K6.

altwarentester@k6-3:~/mplayer$ perf record -f \
 -e cycles:u -e instructions:u -e l1-dcache-loads:u \
 -e l1-dcache-load-misses:u -e branch-misses \
 ./mplayer -ac mp3 -quiet -ao pcm:file=/dev/null  \
 /dev/shm/convergence_-_points_of_view/*.mp3

  Error: perfcounter syscall returned with -1 (No such device)

  Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

It seems the only diagnostic I get is cycles, and that's not very special. On the Duron:

altwarentester@duron:~/mplayer$ perf record -f \
 -e cycles:u -e instructions:u -e l1-dcache-loads:u \
 -e l1-dcache-load-misses:u -e branch-misses \
 ./mplayer -ac mp3 -quiet -ao pcm:file=/dev/null  \
 /dev/shm/convergence_-_points_of_view/*.mp3
  Error: failed to mmap with 1 (Operation not permitted)

Well. That's nearly success. I have to increase the allowed locked memory count from the default 64 K.

Now let's see what we've got. Hm, recording works, but I don't get how to report all that wonderful information. Plain perf report doesn't tell more than usual. I can do that call graph thing, but also here, there is still some strangeness going on with the mpg123 symbols. I'll figure that out.

In the meantime, there is perf stat, which gives a nice overall picture:

altwarentester@duron:~/mplayer$ perf stat ./mplayer -ac mpg123 \
 -quiet -ao pcm:file=/dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   29435.786923  task-clock-msecs         #      0.994 CPUs 
            216  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6780  page-faults              #      0.000 M/sec
    23502185554  cycles                   #    798.422 M/sec
    22870306002  instructions             #      0.973 IPC  
     8562390418  cache-references         #    290.884 M/sec
       17327364  cache-misses             #      0.589 M/sec

   29.605841362  seconds time elapsed

altwarentester@duron:~/mplayer$ perf stat ./mplayer -ac mp3 \
 -quiet -ao pcm:file=/dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   26210.959074  task-clock-msecs         #      0.993 CPUs 
            207  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6290  page-faults              #      0.000 M/sec
    20923114721  cycles                   #    798.258 M/sec
    20818859901  instructions             #      0.995 IPC  
     7867755901  cache-references         #    300.170 M/sec
       10962558  cache-misses             #      0.418 M/sec

   26.387378954  seconds time elapsed

altwarentester@duron:~$ perf stat mpg123/src/mpg123 -w /dev/null \
 /dev/shm/convergence_-_points_of_view/*.mp3

   23163.487659  task-clock-msecs         #      0.996 CPUs 
            140  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            337  page-faults              #      0.000 M/sec
    18501736597  cycles                   #    798.746 M/sec
    20408582208  instructions             #      1.103 IPC  
     7871105536  cache-references         #    339.807 M/sec
       11304577  cache-misses             #      0.488 M/sec

   23.246529422  seconds time elapsed

The K6 is not that generous with details (not counted from cycles on). Regarding cache misses, that's 17 million for mplayer-mpg123, 11 million for mplayer-mp3 and a similar 11 million for plain mpg123. OK, it is sub-optimal. I know that. Now how to dig deeper? I do wonder if switching to generic C code would help since this reduces the amount of functions involved.

The situation with generic C code

Let's try that: Force MPlayer not to activate special optimizations, tell mpg123 to stick to generic decoding. Hm. I see that mp3lib generic code is actually the i386 code. Well, then, two can play that game.

altwarentester@duron:~$ perf stat mpg123/src/mpg123 --cpu i386 \
 -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3

   44296.192117  task-clock-msecs         #      0.996 CPUs 
            263  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            335  page-faults              #      0.000 M/sec
    35386827671  cycles                   #    798.868 M/sec
    34153833893  instructions             #      0.965 IPC  
    10346222151  cache-references         #    233.569 M/sec
       13440804  cache-misses             #      0.303 M/sec

   44.465953954  seconds time elapsed

altwarentester@duron:~/mplayer-generic$ perf stat ./mplayer -ac mp3 \
  -quiet -ao pcm:file=/dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   32032.688338  task-clock-msecs         #      0.994 CPUs 
            233  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6284  page-faults              #      0.000 M/sec
    25575954000  cycles                   #    798.433 M/sec
    28423479288  instructions             #      1.111 IPC  
     9061572976  cache-references         #    282.885 M/sec
       13821093  cache-misses             #      0.431 M/sec

   32.228199936  seconds time elapsed

altwarentester@duron:~/mplayer-generic$ perf stat ./mplayer -ac mpg123 \
  -quiet -ao pcm:file=/dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   49121.685292  task-clock-msecs         #      0.995 CPUs 
            309  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6781  page-faults              #      0.000 M/sec
    39219721233  cycles                   #    798.420 M/sec
    35350587000  instructions             #      0.901 IPC  
    10824302596  cache-references         #    220.357 M/sec
       21094060  cache-misses             #      0.429 M/sec

   49.370325203  seconds time elapsed

Wait a moment... This looks a lot like mpg123's C code being mashed into an inferior binary form. It's still slower from MPlayer than from stand-alone mpg123, but the comparison to mp3lib is harsh. But then, wait a bit more: Is MPlayer really playing fair? Did I hack mp3lib correctly? Those 32 seconds aren't that much more than 26 seconds. Let's record.

Hm, well... it looks like generic code, alright. Building mpg123 with -O4 and the rest of the MPlayer CFLAGS doesn't change too much: runtime stays around 44 seconds for standalone mpg123. But the comparison is still somewhat interesting: Cache is not the point here.

altwarentester@duron:~/mpg123-mpopt$ perf stat src/mpg123 --cpu i386 \
 -w /dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   44294.299679  task-clock-msecs         #      0.996 CPUs 
            272  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            325  page-faults              #      0.000 M/sec
    35383207571  cycles                   #    798.821 M/sec
    34414257475  instructions             #      0.973 IPC  
    10373238416  cache-references         #    234.189 M/sec
       17922320  cache-misses             #      0.405 M/sec

   44.451762052  seconds time elapsed

Well, that fits with the earlier comparison: Differences in cache hit or miss are not really significant, the mpg123 C code plainly sucks!

So, I got two issues at hand:

With 3DNowExt code, there is a definite penalty that MPlayer gets compared to stand-alone mpg123. This might have to do with CPU cache performance.
The non-assembly mpg123 code doesn't work out well compared to MPlayer's C code. I had a peek and it really looks like the synth function just needs (much) more time.

Great. Which one will I fix? I must admit that I am tempted to find out what works differently with the mpg123 C code in its two incarnations. Well, for starters, it could be the change I had to make to give the code thread-safety. That meant movement of quite some data structures from static variables to the handle on the heap.

Let'c check mpg123-0.68, which predates the move to the thread-safe library, with the MPlayer CFLAGS:

altwarentester@duron:~/mpg123-0.68$ perf stat src/mpg123 --cpu i386 \
 -w /dev/null    /dev/shm/convergence_-_points_of_view/*.mp3

   43475.950745  task-clock-msecs         #      0.996 CPUs 
            264  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            317  page-faults              #      0.000 M/sec
    34730969332  cycles                   #    798.855 M/sec
    34317362404  instructions             #      0.988 IPC  
    10323913086  cache-references         #    237.463 M/sec
       12847122  cache-misses             #      0.295 M/sec

   43.628917385  seconds time elapsed

Does not look like that much of an extra hit. Well, about 2 % difference to the current version there. So, it's not nothing. But in that range I lean to argue that there has to be some cost involved in making the code thread-safe, with all that data in the handle and al. Of course, my approach to libmpg123 should have room for cleanup and optimization per se, but we are not talking about 2 % here, compared to mp3lib. This is more serious than the simple transition to libmpg123.

It is difficult to test earlier versions of mpg123 since they happen not to like my test data. One should not forget that I added bug fixes over time! Well, I suspect my approach to CPU runtime optimization choice. Found a set of old MP3s to work on (``Dune'', by ... eh... ``Dune'')... and the difference is appaling:

altwarentester@duron:~/mpg123-0.59r$ perf stat ./mpg123 -t \
 /home/altwarentester/dune/*.mp3

   23312.117777  task-clock-msecs         #      0.996 CPUs 
            152  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
          14344  page-faults              #      0.001 M/sec
    18623038218  cycles                   #    798.857 M/sec
    24254735169  instructions             #      1.302 IPC  
     8597454473  cache-references         #    368.798 M/sec
        4262514  cache-misses             #      0.183 M/sec

   23.394514625  seconds time elapsed

altwarentester@duron:~/mpg123-0.68$ perf stat ./mpg123 -t \
 /home/altwarentester/dune/*.mp3

   32211.610636  task-clock-msecs         #      0.996 CPUs 
            199  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            296  page-faults              #      0.000 M/sec
    25731964846  cycles                   #    798.841 M/sec
    30810795898  instructions             #      1.197 IPC  
    10904946603  cache-references         #    338.541 M/sec
        7285056  cache-misses             #      0.226 M/sec

   32.326412050  seconds time elapsed

Frick. That is a huge hit. Hah! I even forgot to specify --cpu i386. With that, I'm at around 70 seconds. Something smells fishy.

What happened to mpg123 on the way to 0.68?

Comparing 0.59r and 0.68 furher... decoding one of the dune tracks, where the old version needs 2.5 seconds and the new one over 6 seconds. Reports:

 
altwarentester@duron:~/mpg123-0.59r$ perf report
# Samples: 15180
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    45.76%       mpg123  ./mpg123                       [.] III_dequantize_sample
    20.47%       mpg123  ./mpg123                       [.] dct64
    15.01%       mpg123  ./mpg123                       [.] dct36
     9.65%       mpg123  ./mpg123                       [.] do_layer3
     3.41%       mpg123  ./mpg123                       [.] synth_1to1
     1.98%       mpg123  ./mpg123                       [.] dct12
     1.03%       mpg123  /lib/i686/cmov/libc-2.11.2.so  [.] memcpy

altwarentester@duron:~/mpg123-0.68$ perf report
# Samples: 44859
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    65.42%       mpg123  ./src/mpg123                   [.] synth_1to1_i386
    15.60%       mpg123  ./src/mpg123                   [.] III_dequantize_sample
     7.63%       mpg123  ./src/mpg123                   [.] dct64_i386
     4.91%       mpg123  ./src/mpg123                   [.] dct36
     3.07%       mpg123  ./src/mpg123                   [.] do_layer3
     0.70%       mpg123  ./src/mpg123                   [.] dct12
     0.34%       mpg123  ./src/mpg123                   [.] III_get_scale_factors_1
     0.33%       mpg123  [kernel]                       [k] __copy_to_user_ll
     0.12%       mpg123  [kernel]                       [k] generic_file_aio_read

I seriously borked up the synth function for this CPU. But remember: This is code that is not run in the initially considered case, only when I enforce avoidance of the optimized decoders. But it could be symptomatic.

Oh, dear.

I listened to some output from 0.59r. It's crap, distorted something. So, perhaps that just doesn't count. weren't it for the finding that the synth function is more heavy also compared to mp3lib. Taking a look at mpg123 0.60. Perhaps that's more sane. No, it isn't. The fun fact is that it works fine on my 64 bit laptop with generic decoder... hm... would generic work? Yes, the generic decoder works also on the Duron. But it's slow, just about as current mpg123.

Fun question: Does MPLayer with the i386 mp3lib decode correctly, like current mpg123? Yes it does. Result looks good. Computing time looks better.

Last resort of mpg123 without multi-cpu code: 0.65. Will that produce proper output? Will it be faster? It produces crap. Faster. Only mpg123-0.66 finally produces proper output with i386 decoder. But is that a generic property? Shouldn't I have noticed such breakage before? I don't see it mentioned in NEWS.

There it is:

------------------------------------------------------------------------
r577 | thor | 2007-02-22 03:04:23 +0100 (Do, 22. Feb 2007) | 4 Zeilen

fixing i386 code with gcc-4.1 by using the old WRITE_SAMPLE again - the
new one has distorted sound and gcc complains about strict aliasing ...
the old WRITE_SAMPLE even seems to be faster with gcc-3.4!
Of course I didn't (yet) test on my real i386DX40 if there's an advantage.
Started mocking with cpuflags to get proper MMX and SSE detection.

------------------------------------------------------------------------

Well, there it may be. Let's switch around WRITE_SAMPLE. And we got a change from

    2566.163526  task-clock-msecs         #      0.994 CPUs 
             21  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            300  page-faults              #      0.000 M/sec
     2047880985  cycles                   #    798.032 M/sec
     2041781623  instructions             #      0.997 IPC  
      798943574  cache-references         #    311.338 M/sec
        4049486  cache-misses             #      1.578 M/sec

    2.581971582  seconds time elapsed

    6272.432955  task-clock-msecs         #      0.989 CPUs 
             51  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            300  page-faults              #      0.000 M/sec
     5006116015  cycles                   #    798.114 M/sec
     4398072978  instructions             #      0.879 IPC  
     1320889896  cache-references         #    210.587 M/sec
        4730705  cache-misses             #      0.754 M/sec

    6.343654290  seconds time elapsed

Very well. Seems like I have to learn from MPlayer about fixing up the new WRITE_SAMPLE. And yes, there is something in revision 7300; something about gcc 3.x being too clever for the double - int conversion trick. Applying the fix from MPlayer to mpg123 0.65 yields

    5213.636059  task-clock-msecs         #      0.985 CPUs 
             46  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            301  page-faults              #      0.000 M/sec
     4161133383  cycles                   #    798.125 M/sec
     4203986268  instructions             #      1.010 IPC  
     1302169739  cache-references         #    249.762 M/sec
        4555437  cache-misses             #      0.874 M/sec

    5.293784289  seconds time elapsed

(and good-sounding output, of course).

Going back to the original test album, writing to /dev/null (the stuff before wrote to a real file to check the output):

   35322.795994  task-clock-msecs         #      0.996 CPUs 
            215  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            315  page-faults              #      0.000 M/sec
    28215905365  cycles                   #    798.802 M/sec
    32624122756  instructions             #      1.156 IPC  
    10324164674  cache-references         #    292.281 M/sec
       12229060  cache-misses             #      0.346 M/sec

   35.457957266  seconds time elapsed

That's not 70 seconds. WRITE_SAMPLE is a sore spot, but that it is that influential on the Duron is a surprise. I have to check on a more modern system, with a 32 bit install. But where came the 70 seconds from anyway? The current mpg123 had only about 44 seconds. This CPU is sensitive. And don't forget: I observed the old WRITE_SAMPLE even being faster back then!

Anyhow, 35 seconds is still above the 32 seconds mp3lib achieves. There is more to the generic/i386 C code. Stay tuned.

Found MPlayer rev. 28866. Reducing the size of bandInfoStruct bring that:

   35303.691456  task-clock-msecs         #      0.996 CPUs 
            215  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            314  page-faults              #      0.000 M/sec
    28199408110  cycles                   #    798.767 M/sec
    32624433817  instructions             #      1.157 IPC  
    10330702614  cache-references         #    292.624 M/sec
       14060449  cache-misses             #      0.398 M/sec

   35.430962884  seconds time elapsed

In short: nothing. The preceeding rev. 28865 has some const stuff. That might play a role. Then there is Zuxy's rev. 23485 that changed some static vars to automatic. Generally, he worked on SSE stuff, that's not of interest right now. There is the fast_memcpy business. I should check that. Rev. 22158 does buffer alignment. Revision 18946 fixes something in the 3DNow DCT64 ... and makes we wonder what exactly is the relation of mpg123's and MPlayers plain 3DNow code. That is another matter to investigate.

For now, it seems that it is fruitful enough to pursue merging of apparent improvements in the generic/overall C code of mp3lib that we indeed still miss in mpg123.

Hacking the ``new'' WRITE_SAMPLE into current mpg123 does yield a tiny improvement over the hacked 0.65, even.

altwarentester@duron:~/mpg123$ perf stat ./src/mpg123 -w /dev/null \
 --cpu i386  /dev/shm/convergence_-_points_of_view/*.mp3

   35030.997934  task-clock-msecs         #      0.996 CPUs
            225  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            333  page-faults              #      0.000 M/sec
    27983581759  cycles                   #    798.823 M/sec
    33233941274  instructions             #      1.188 IPC
    10325844134  cache-references         #    294.763 M/sec
       11774779  cache-misses             #      0.336 M/sec
   
   35.176724921  seconds time elapsed

altwarentester@duron:~/mpg123-0.65$ perf stat ./src/mpg123 -w /dev/null \
 /dev/shm/convergence_-_points_of_view/*.mp3

   35215.369977  task-clock-msecs         #      0.995 CPUs
            226  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            314  page-faults              #      0.000 M/sec
    28126526481  cycles                   #    798.700 M/sec
    32624128394  instructions             #      1.160 IPC
    10335841084  cache-references         #    293.504 M/sec
       12821386  cache-misses             #      0.364 M/sec
   
   35.387834336  seconds time elapsed

Too bad that this doesn't fit the current framework of separated clipping and conversion too well. Seems like, at least for some CPUs (I really should wire up the i486, or even the i386...), this separation hurts. But then: We are talking about the i386 C code, which is not used by default from old Pentiums onwards. This is rather academic. The performance of the plain C code doesn't matter at all in practice, when there's nowadays a variant of Taihei's SSE code in use.

In MPlayer, this change improves the situation, but it is still 39 seconds against 32 seconds:

perf stat ./mplayer -ac mp3 -ao pcm:file=/dev/null -quiet \
 /dev/shm/convergence_-_points_of_view/*.mp3

   32049.354400  task-clock-msecs         #      0.994 CPUs 
            249  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6293  page-faults              #      0.000 M/sec
    25589163911  cycles                   #    798.430 M/sec
    28424067520  instructions             #      1.111 IPC  
     9060506927  cache-references         #    282.705 M/sec
       12323035  cache-misses             #      0.385 M/sec

   32.251952962  seconds time elapsed

perf stat ./mplayer -ac mpg123 -ao pcm:file=/dev/null -quiet \
 /dev/shm/convergence_-_points_of_view/*.mp3

   39542.041678  task-clock-msecs         #      0.994 CPUs 
            298  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6790  page-faults              #      0.000 M/sec
    31575386518  cycles                   #    798.527 M/sec
    34341985738  instructions             #      1.088 IPC  
    10817342380  cache-references         #    273.566 M/sec
       17943896  cache-misses             #      0.454 M/sec

   39.787800467  seconds time elapsed

Why does GCC hate my code?

But how does the picture of CPU usage per function look now? With mp3lib:

    36.40%      mplayer  ./mplayer                            [.] synth_1to1
    28.16%      mplayer  ./mplayer                            [.] III_dequantize_sample
    10.59%      mplayer  ./mplayer                            [.] dct64_1
     9.05%      mplayer  ./mplayer                            [.] dct36
     4.23%      mplayer  ./mplayer                            [.] T.293
     3.92%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy

With libmpg123:

    40.43%      mplayer  ./mplayer                            [.] INT123_synth_1to1_i386
    22.31%      mplayer  ./mplayer                            [.] III_dequantize_sample
     7.68%      mplayer  ./mplayer                            [.] INT123_dct36
     6.73%      mplayer  ./mplayer                            [.] INT123_dct64_i386
     6.07%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy
     3.51%      mplayer  ./mplayer                            [.] INT123_synth_ntom
     3.35%      mplayer  ./mplayer                            [.] INT123_do_layer3
     1.25%      mplayer  ./mplayer                            [.] feed_read

That's still quite some overhang for the synth function itself. And even if it's totally irrelevant in practice... I want to see how I messed up the synth function for Durons.

A quick look on the Core2Duo in 32 bit chroot:

An mpg123 binary built with '-march=pentium3' and mpg123 default optimization shows perhaps some difference regarding the WRITE_SAMPLE change. using i386 decoder. I might be convinced to give it 2 % advantage. It's harder to tell a difference when using the generic decoder, but there might be one. When building for -march=pentium, both differences are around 1 to 1.5 %. One would need some statistics for more definite answer, but the main point is: The effect, if any, is much less pronounced on the Core2Duo machine. Might be a reason for me not spotting anything wrong. The message behind this, of course, is that tuning things for the Duron should not hurt performance on newer machines (which use SSE optimization anyway, d'oh).

A root cause

of the slowness seems to be some inefficient translation of WRITE_SAMPLE resulting in floating point stack messiness. Seems like the summand for the fadd is kept on the fp stack in mp3lib, but is grabbed from main stack repeatedly for mpg123. One indicator is the appearance of the fxch instruction in mpg123's synth_1to1 function. It is totally absent in mp3lib's version.

Now the question is how I can rearrange to code to avoid this. I am working on the 0.65 source to stay closer to mp3lib.

A cause to the cause?

Yes, there might be a simple explanation: The GCC optimizer is pushed along different ways because the mpg123 code is built to separate object files, with public functions, while mp3lib pushes everything into a single object, with stuff like synth_1to1 being all-static. That eases optimzation of (floating point) stack usage.

An interesting side note is that for the mpg123 code, the floating point stack messup seems to be pushed by enabling optimization at all.

But anyway, repackaging the stuff to have the direct calls to the synth functions together with the (then static) functions might do the trick. For modern mpg123 this would mean packaging all synth functions and layer[123].c files together with optimise.c, perhaps frame.c, into one compilation unit. An indicator could be this: If there are public symbols that are not part of the API in a static libmpg123, those symbols should have been static and not public.

Of course, performance is only one possible reason to make these functions static: It is bad library design to have those symbols floating around and possibly colliding when someone uses the static library.

I tried this on the Core2Duo and, if anything, this makes things perhaps slightly worse. On the Duron the tendency to worsen the performance is also indicated. At least the more static build is about 0.5 K smaller. The effect of adapting Dan's patch for const huffman tables is bigger (both in performance and binary size). But still, this is not really big. I need to get to the core of the bad WRITE_SAMPLE translation. How does the compiler get the hint from MPlayer code? And... how about building with the Intel or open64 compilers for a change? Well, one bummer is that I have those installed for 64 bits, not 32.

On the Duron, the effect of the const tables is also quite small.

Revisit after some time...

Just make sure I still know what I am talking about, mpg123-0.65-like-mplayer r3005, my old MPlayer checkout, both with i386 C code. Perf says for MPlayer:

   32331.247743  task-clock-msecs         #      0.994 CPUs 
            371  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6301  page-faults              #      0.000 M/sec
    25814521225  cycles                   #    798.439 M/sec
    28424759003  instructions             #      1.101 IPC  
     9068719643  cache-references         #    280.494 M/sec
       13063911  cache-misses             #      0.404 M/sec

   32.523356972  seconds time elapsed

# Samples: 270005
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    35.82%      mplayer  ./mplayer                            [.] synth_1to1
    28.14%      mplayer  ./mplayer                            [.] III_dequantize_sample
    10.57%      mplayer  ./mplayer                            [.] dct64_1
     9.18%      mplayer  ./mplayer                            [.] dct36
     4.19%      mplayer  ./mplayer                            [.] T.293
     4.07%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy
     0.56%      mplayer  ./mplayer                            [.] fast_memcpy

And for mpg123:

   34845.338614  task-clock-msecs         #      0.996 CPUs
            377  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            245  page-faults              #      0.000 M/sec
    27835529165  cycles                   #    798.831 M/sec
    32659069064  instructions             #      1.173 IPC
    10169107285  cache-references         #    291.836 M/sec
        8112060  cache-misses             #      0.233 M/sec
   
   34.982723275  seconds time elapsed

# Samples: 292093
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    51.58%       mpg123  ./src/mpg123                         [.] synth_1to1
    24.78%       mpg123  ./src/mpg123                         [.] III_dequantize_sample
     8.20%       mpg123  ./src/mpg123                         [.] dct64
     7.84%       mpg123  ./src/mpg123                         [.] dct36
     3.73%       mpg123  ./src/mpg123                         [.] do_layer3
     0.52%       mpg123  [kernel]                             [k] __copy_to_user_ll
     0.31%       mpg123  ./src/mpg123                         [.] dct12
     0.25%       mpg123  ./src/mpg123                         [.] III_get_scale_factors_1
     0.17%       mpg123  /lib/i686/cmov/libc-2.11.2.so        [.] __mempcpy

It is still apparent that synth of mpg123 needs more time. 51.6% of 35 seconds, that is, 18 seconds. For MPlayer, that's 35.8% of 32.5 secods. 11.6 seconds. Let's summarize that.

function	time MPlayer / s	time mpg123 / s
synth_1to1	11.64	18.04
III_dequantize_sample	9.15	8.67
dct64	3.44	2.87
dct36	2.99	2.74
T.293 / do_layer3	1.36	1.30

That does look interesting. If only the synth function were faster, mpg123 would easily beat mp3lib. All other main functions are a tad quicker, even, than the mp3lib counterparts! The perf annotate ist not too helpful in placing the blame on certain code portions, but I know already what the main difference is. Or do I? The annotations show huge runtime hit for the faddp instruction in mpg123, where it doesn't really count in MPlayer, at the sample places.

The output of perf annotate is confusing. The association of code and assembly seems rather whacky. But first, what is that?

     :          for (j=16;j;j--,b0+=0x10,window+=0x20,samples+=step)
0.14 :       806ca36:       83 c3 40                add    $0x40,%ebx
0.64 :       806ca39:       83 c2 40                add    $0x40,%edx
0.01 :       806ca3c:       83 e9 80                sub    $0xffffff80,%ecx
0.22 :       806ca3f:       83 c6 04                add    $0x4,%esi
1.11 :       806ca42:       81 fb 00 04 00 00       cmp    $0x400,%ebx
0.00 :       806ca48:       0f 84 22 01 00 00       je     806cb70 <synth_1t
0.10 :       806ca4e:       dd d8                   fstp   %st(0)
0.83 :       806ca50:       dd d8                   fstp   %st(0)
0.12 :       806ca52:       dd d8                   fstp   %st(0)
0.02 :       806ca54:       dd d8                   fstp   %st(0)
0.47 :       806ca56:       dd d8                   fstp   %st(0)
0.09 :       806ca58:       dd d8                   fstp   %st(0)

Such floating point stack messing around is not to be found, in MPlayer. Then, the detail:

     :            sum += window[0x4] * b0[0x4];
1.18 :       82478f2:       d9 42 20                flds   0x20(%edx)
0.60 :       82478f5:       d8 49 20                fmuls  0x20(%ecx)
0.11 :       82478f8:       de c1                   faddp  %st,%st(1)
     :            sum -= window[0x5] * b0[0x5];
0.71 :       82478fa:       d9 42 28                flds   0x28(%edx)
0.46 :       82478fd:       d8 49 28                fmuls  0x28(%ecx)
0.17 :       8247900:       de c1                   faddp  %st,%st(1)

And a similar portion from mpg123:

         :            sum += window[0x4] * b0[0x4];
    0.36 :       806caca:       d9 44 24 34             flds   0x34(%esp)
    0.00 :       806cace:       d8 49 20                fmuls  0x20(%ecx)
    1.90 :       806cad1:       de c7                   faddp  %st,%st(7)
         :            sum -= window[0x5] * b0[0x5];
    0.35 :       806cad3:       d9 44 24 2c             flds   0x2c(%esp)
    0.00 :       806cad7:       d8 49 28                fmuls  0x28(%ecx)
    2.09 :       806cada:       de c7                   faddp  %st,%st(7)

The exact attribution to one instruction might be tough here, but it is apparent that very similar pieces of code perform consistently different. The location on the floating point stack is different. And, hm, why are array offsets different? This pattern is carried through: The MPlayer code has matching offsets on the load and multiplication operations, especially matching the array offsets in the C code. In mpg123, that is different.

Comparing the code for the two synth functions,

I see only one difference at all: The possible call of the equalizing function. Hm, it might be worth a try to remove that bit of code. It introduces a branch... and inserts code for a possible function call. But no, that doesn't make a difference. I'm relieved.

Now, that leaves the strange decisions of the compiler on the table. It does weird stuff with the floating point sums. Generally, the floating point code is messy. What's the magic compiler switch to change that?

I added some disassembly and the corresponding source to SVN, the public may have a peek at http://mpg123.orgis.org/cgi-bin/scm/mpg123/notes/beating_mp3lib_in_mplayer_material/.

Separating out the synth function in the MPlayer code into its own object file (not static, obviously) yields binary code very similar to what gcc produces for mpg123.

A proposal for different clipping

from Ivan didn't do much good on the Duron, actually worsened it a bit. The idea was to save one if branch by addition and bitmask comparison, but this is not guaranteed to be faster... Anyhow, this is tangential to the question why the MPlayer build works better.

Back to the assembly line

, my gaze wanders towards the differences outlined above again, and I realize that the array index strangeness might have to do with the fact that there is %esp involved in mpg123. The function stack versus a pointer in a separate register. Oh, also, I see a trigger for the bad code: Removing the call to DCT64 improves things, not entirely surprisingly. This is a piece of diff between the separated-out synth function without and with call to non-static dct64:

119,121c170,172
<   93: d9 41 20                flds   0x20(%ecx)
<   96: d8 4a 20                fmuls  0x20(%edx)
<   99: de c2                   faddp  %st,%st(2)
---
>  109: d9 44 24 28             flds   0x28(%esp)
>  10d: d8 4a 20                fmuls  0x20(%edx)
>  110: de c7                   faddp  %st,%st(7)
123,125c174,176
<   9b: d9 41 28                flds   0x28(%ecx)
<   9e: d8 4a 28                fmuls  0x28(%edx)
<   a1: de c2                   faddp  %st,%st(2)
---
>  112: d9 44 24 20             flds   0x20(%esp)
>  116: d8 4a 28                fmuls  0x28(%edx)
>  119: de c7                   faddp  %st,%st(7)

Now, as I said, a function call messing with stack/register handling is not entirely spurprising. But, still, I should wrap my head around what exactly happens here. Especially, since including static dct64 didn't seem to help mpg123. But what does it do here? Let's try.

Gah! Static! I forgot that there are static variables in the synth function (there aren't anymore for current mpg123). I simply defined the static keyword to nothing for the separated object, only thinking about the static function, not static variables. Does the situation change when I just remove the static keyword from the function?

156,158c223,225
<   b2: d9 42 20                flds   0x20(%edx)
<   b5: d8 49 20                fmuls  0x20(%ecx)
<   b8: de c1                   faddp  %st,%st(1)
---
>  144: d9 44 24 34             flds   0x34(%esp)
>  148: d8 49 20                fmuls  0x20(%ecx)
>  14b: de c7                   faddp  %st,%st(7)
160,162c227,229
<   ba: d9 42 28                flds   0x28(%edx)
<   bd: d8 49 28                fmuls  0x28(%ecx)
<   c0: de c1                   faddp  %st,%st(1)
---
>  14d: d9 44 24 2c             flds   0x2c(%esp)
>  151: d8 49 28                fmuls  0x28(%ecx)
>  154: de c7                   faddp  %st,%st(7)

OK, that's still the same, basically. Only the stack location changes. Now, how does the stuff land on the stack anyways?

An indication of victory

is that, after simply disabling the call to dct64 (and, thusly, producing rubbish) both in mp3lib and mpg123, the mpg123 binary beats MPlayer with 25 agains 28 seconds decoding time. This function call messes up the code optimization for mpg123 and somehow does not for mp3lib.

This is creepy. I hacked around with the mp3lib code for a while, making functions non-static, cleaning away distracting parts... and then this: The optional optimized synth function is disabled at runtime by my hacking, but the call was still present in the synth.

  if ( synth_func )
   {
//    printf("Calling %p, bandPtr=%p channel=%d samples=%p\n",synth_func,bandPtr,channel,samples);
    // FIXME: synth_func() may destroy EBP, don't rely on stack contents!!!
    return (*synth_func)( bandPtr,channel,samples);
   }

Now, if I remove that function call,

  if ( synth_func )
   {
//    printf("Calling %p, bandPtr=%p channel=%d samples=%p\n",synth_func,bandPtr,channel,samples);
    // FIXME: synth_func() may destroy EBP, don't rely on stack contents!!!
    /* return (*synth_func)( bandPtr,channel,samples); */
   }

I suddenly get my bad code delivered! Let's check the actual performance... Yup, with the call present (but inactive), we have the usual 32 seconds, with the call removed, it's 40 seconds.

OK... so I see now how I could fix the generic decoding to be better. Via some nasty trick I might convince gcc to produce the correct output for mpg123; or, I even capture the good assembly and have the i386 code as assembly optimization, which is a bit crazy, given that there is the i586 optimization anyway.

But first, let's verify the concept, I added this little fake function someplace else than decode_i386.c (so that the symbol exists and the function call code actually is generated for lack of link-time inlining).

int synth_1to1_fake(real *bandPtr,int channel,unsigned char *out,int *pnt)
{
	return 0;
}

Then, the bogus fake call gets added to the synth.

int moody_switch = 0;
DECODE_SCOPE int synth_1to1(real *bandPtr,int channel,unsigned char *out,int *pnt)
{
#ifndef PENTIUM_OPT
  static real buffs[2][2][0x110];
  static const int step = 2;
  static int bo = 1;
  short *samples = (short *) (out + *pnt);

  real *b0,(*buf)[0x110];
  int clip = 0; 
  int bo1;

  *pnt += 128; /* moved that here as MPlayer has it */
#endif

  if(moody_switch)
    return synth_1to1_fake(bandPtr, channel, out, pnt);

Well, the result...

   28522.600345  task-clock-msecs         #      0.996 CPUs 
            302  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            246  page-faults              #      0.000 M/sec
    22784727766  cycles                   #    798.831 M/sec
    27729388895  instructions             #      1.217 IPC  
     9218523630  cache-references         #    323.201 M/sec
        7898355  cache-misses             #      0.277 M/sec

   28.650573967  seconds time elapsed

Nifty, eh? Now I have a grip on how to hack utmost speed out of the generic C code on x86 with gcc. That is an interesting point, but it is not the main matter of this investigation. Let's look at what the binding of libmpg123 to mplayer does to hurt performance by 30%.

Pack your things and go, mp3lib, you're owned! Well, owned by plain mpg123, that is. Libmpg123 still has some way to go.

Back to the Binding

Without the hands tied behind its back, mpg123 is faster than MPlayer (using 3DNowExt optimizations). But the binding of libmpg123 to MPlayer brings great suffering. Let's investigate that.

MPlayer's mp3lib has such a profile,

# Samples: 217361
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    34.08%      mplayer  ./mplayer                            [.] III_dequantize_sample
    17.89%      mplayer  ./mplayer                            [.] synth_1to1_MMX
    16.40%      mplayer  ./mplayer                            [.] dct36_3dnowex
    11.61%      mplayer  ./mplayer                            [.] dct64_MMX_3dnowex
     5.27%      mplayer  ./mplayer                            [.] T.293
     4.85%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy

while with libmpg123 usage, it is

# Samples: 248096
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    30.14%      mplayer  ./mplayer                            [.] III_dequantize_sample
    26.41%      mplayer  ./mplayer                            [.] 0x000000009470c1
     9.41%      mplayer  ./mplayer                            [.] 0x00000000950941
     8.04%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy
     7.30%      mplayer  ./mplayer                            [.] INT123_synth_ntom
     4.62%      mplayer  ./mplayer                            [.] INT123_do_layer3
     1.69%      mplayer  ./mplayer                            [.] feed_read
     1.15%      mplayer  ./mplayer                            [.] INT123_dct64
     0.97%      mplayer  ./mplayer                            [.] INT123_synth_1to1_3dnowext

Wait. What?! Ntom? And the actual synth with 0.97%? That is fishy. The figures may be right, but the names aren't. Perhaps the static link messes things up a bit (but why should it?). Let's link to the system libmpg123.

# Samples: 225783
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    82.32%      mplayer  /usr/lib/libmpg123.so.0.25.1         [.] 0x00000000028247
     7.92%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy
     1.00%      mplayer  /usr/lib/libmpg123.so.0.25.1         [.] 0x000000000147c4
     0.65%      mplayer  [kernel]                             [k] __copy_to_user_ll
     0.53%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] 0x000000000705dd
     0.31%      mplayer  ./mplayer                            [.] avpriv_ac3_parse_header
     0.31%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] __mempcpy
     0.28%      mplayer  ./mplayer                            [.] ac3_eac3_probe
     0.23%      mplayer  ./mplayer                            [.] demux_audio_fill_buffer
     0.19%      mplayer  /usr/lib/libmpg123.so.0.25.1         [.] mpg123_decode

Whut? Get out already. What is that? Ok, fresh build, dynamic link to local libmpg123.

# Samples: 245675
#
# Overhead      Command                 Shared Object  Symbol
# ........  ...........  ............................  ......
#
    36.09%      mplayer  .libs/libmpg123.so.0.33.1     [.] 0x00000000043685
    31.83%      mplayer  .libs/libmpg123.so.0.33.1     [.] III_dequantize_sample
     8.23%      mplayer  /lib/i686/cmov/libc-2.11.2.so [.] memcpy
     4.86%      mplayer  .libs/libmpg123.so.0.33.1     [.] INT123_synth_ntom
     4.65%      mplayer  .libs/libmpg123.so.0.33.1     [.] INT123_do_layer3
     1.67%      mplayer  .libs/libmpg123.so.0.33.1     [.] feed_read
     1.25%      mplayer  .libs/libmpg123.so.0.33.1     [.] INT123_dct64
     1.06%      mplayer  .libs/libmpg123.so.0.33.1     [.] INT123_synth_1to1_3dnowext

Wtf? First: This needs more time than system mpg123, second: Those symbols can't be right.

Checking this... Heck, the NtoM synth is called! And, know what? It's because of a bug in the libmpg123 parser in feeder mode (I really should not have added this contraption): It did not seek back 4 bytes if the ahead check past the first frame does not work. This pecuilar bug is triggered because MPlayer also parses the MPEG data and manages to give exactly one frame of data at the beginning.

Now, for some more sensible numbers and names...

   27269.086270  task-clock-msecs         #      0.993 CPUs 
            322  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6858  page-faults              #      0.000 M/sec
    21765861127  cycles                   #    798.188 M/sec
    20944822745  instructions             #      0.962 IPC  
     8198643751  cache-references         #    300.657 M/sec
       16856742  cache-misses             #      0.618 M/sec

   27.465041648  seconds time elapsed

# Samples: 228283
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    34.38%      mplayer  .libs/libmpg123.so.0.33.1      [.] III_dequantize_sample
    30.03%      mplayer  .libs/libmpg123.so.0.33.1      [.] 0x00000000043e7d
    10.94%      mplayer  .libs/libmpg123.so.0.33.1      [.] 0x00000000042e2d
     7.84%      mplayer  /lib/i686/cmov/libc-2.11.2.so  [.] memcpy
     4.97%      mplayer  .libs/libmpg123.so.0.33.1      [.] INT123_do_layer3
     1.23%      mplayer  .libs/libmpg123.so.0.33.1      [.] INT123_synth_1to1_3dnowext
     0.66%      mplayer  .libs/libmpg123.so.0.33.1      [.] III_get_scale_factors_1
     0.63%      mplayer  [kernel]                       [k] __copy_to_user_ll

Now that looks better, but only covers a bug I must have introduced recently (yes, revision 2935 must've been it). Those 27.5 seconds are still rather bad compared to plain mpg123:

   24365.715653  task-clock-msecs         #      0.991 CPUs 
            372  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           7600  page-faults              #      0.000 M/sec
    19462031681  cycles                   #    798.747 M/sec
    20804412868  instructions             #      1.069 IPC  
     8193661424  cache-references         #    336.278 M/sec
       13213580  cache-misses             #      0.542 M/sec

   24.576109613  seconds time elapsed

# Samples: 200099
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    46.39%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] 0x000000000413ac
    38.52%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] III_dequantize_sample
     5.81%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] INT123_do_layer3
     1.45%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] INT123_synth_1to1_3dnowext
     0.78%    lt-mpg123  [kernel]                       [k] __copy_to_user_ll
     0.74%    lt-mpg123  /lib/i686/cmov/libc-2.11.2.so  [.] __mempcpy
     0.66%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] III_get_scale_factors_1

I must assume that this first block encompasses all synth functions combined (asm aynth and the DCTs).

How is the absolute comparison to mp3lib, per function? First the raw data again...

   26190.932710  task-clock-msecs         #      0.993 CPUs 
            333  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           6307  page-faults              #      0.000 M/sec
    20908670446  cycles                   #    798.317 M/sec
    20816640224  instructions             #      0.996 IPC  
     7871360378  cache-references         #    300.538 M/sec
       11813272  cache-misses             #      0.451 M/sec

   26.370606238  seconds time elapsed

# Samples: 217594
#
# Overhead      Command                        Shared Object  Symbol
# ........  ...........  ...................................  ......
#
    34.25%      mplayer  ./mplayer                            [.] III_dequantize_sample
    17.89%      mplayer  ./mplayer                            [.] synth_1to1_MMX
    16.45%      mplayer  ./mplayer                            [.] dct36_3dnowex
    11.35%      mplayer  ./mplayer                            [.] dct64_MMX_3dnowex
     5.32%      mplayer  ./mplayer                            [.] T.293
     4.83%      mplayer  /lib/i686/cmov/libc-2.11.2.so        [.] memcpy
     0.65%      mplayer  [kernel]                             [k] __copy_to_user_ll
     0.59%      mplayer  ./mplayer                            [.] synth_1to1

As a bonus, I compare also with good old mpg123 0.65 (the hacked i386 code). It is eerily close.

   28434.455457  task-clock-msecs         #      0.996 CPUs 
            298  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
            246  page-faults              #      0.000 M/sec
    22713949581  cycles                   #    798.818 M/sec
    27729289453  instructions             #      1.221 IPC  
     9204425341  cache-references         #    323.707 M/sec
        7354561  cache-misses             #      0.259 M/sec

   28.561796156  seconds time elapsed

# Samples: 234591
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    40.85%       mpg123  ./src/mpg123                   [.] synth_1to1
    30.31%       mpg123  ./src/mpg123                   [.] III_dequantize_sample
    10.18%       mpg123  ./src/mpg123                   [.] dct64
     9.77%       mpg123  ./src/mpg123                   [.] dct36
     4.62%       mpg123  ./src/mpg123                   [.] do_layer3
     0.64%       mpg123  [kernel]                       [k] __copy_to_user_ll
     0.34%       mpg123  ./src/mpg123                   [.] dct12
     0.32%       mpg123  ./src/mpg123                   [.] III_get_scale_factors_1
     0.23%       mpg123  /lib/i686/cmov/libc-2.11.2.so  [.] __mempcpy

I get just that bit of time for all that assembly code? SIMD and all?

For mpg123, the individual asm functions cannot be separated for some reason, so let's combine the numbers:

	time	time	time	time
	MPlayer	MPlayer	mpg123	mpg123
function	+mp3lib / s	+libmpg123 / s	trunk / s	0.65-i386 / s
synth & DCT	12.20	11.59	11.75	17.36
III_dequantize_sample	9.03	9.44	9.46	8.66
T.293 / do_layer3	1.40	1.36	1.43	1.31
memcpy+to_user	1.31	2.33	0.37	0.25
sum of above	23.94	24.72	23.01	27.58
diff to mpg123	0.93	1.71	0	4.57
total runtime diff	1.79	2.89	0	3.98

So, I can explain $59\%$ of the difference to mpg123 from these main contributors to computing time. And it is rather clear where that comes from: memcpy. The same is even true for mp3lib, there, memcpy also plays a larger role.

A more subtle point is to be made about the assembly synths in mpg123 being a bit faster (yay, despite criticism about inefficient stack handling;-), while, again, MPlayer wins on the efficiency of GCC-generated code. They managed to befriend to optimizer somehow.

Since I like to explore to the sidess before going for the bulls eye, I am going to have a look at what exactly is different in the sample dequantization. This comparison is messy. Looking at that huge slab of code with CPU time distributed all over the place, but somewhat prominent on code portions that are rather identical between mp3lib and mpg123, I think it is not worth it no nail down the optimizer here. That half of a second can come and go.

Well, but what could be worth it is to investigate what the heck happened since mpg123 0.65 here. I am comparing to the plain C code for i386 (well, boosted a bit by the recent improvements / hack for GCC optimizer)! I see the 3DNowExt assembly shaving off 5.61 seconds from the synth stuff, but here, 0.8 seconds are lost in the dequantization again. Eh, I have to take a step back. I just glossed over half a second compared to mp3lib, I can ignore 0.8 seconds, too. Really, I have seen what GCC is capable of with the rather simple synth code. I will not start a battle about 0.8 seconds in III_dequantize_sample. I need some lifetime left. Let's be happy that in the sum, 14% of decoding performance is gained from 3DNowExt.

What spooks me here, is that the plain 3DNow code makes mpg123 even slower than the beefed-up generic code:

   29208.504107  task-clock-msecs         #      0.995 CPUs 
            389  context-switches         #      0.000 M/sec
              0  CPU-migrations           #      0.000 M/sec
           4151  page-faults              #      0.000 M/sec
    23331202619  cycles                   #    798.781 M/sec
    26197929267  instructions             #      1.123 IPC  
     9571621109  cache-references         #    327.700 M/sec
       13034794  cache-misses             #      0.446 M/sec

   29.341045891  seconds time elapsed

# Samples: 248648
#
# Overhead      Command                  Shared Object  Symbol
# ........  ...........  .............................  ......
#
    31.26%    lt-mpg123  ./libs/libmpg123.so.0.33.1     [.] III_dequantize_sample
    30.25%    lt-mpg123  ./libs/libmpg123.so.0.33.1     [.] 0x00000000041684
    26.01%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] 0x00000000042611
     4.54%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] INT123_do_layer3
     0.89%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] INT123_synth_1to1_3dnow
     0.73%    lt-mpg123  /lib/i686/cmov/libc-2.11.2.so  [.] __mempcpy
     0.64%    lt-mpg123  [kernel]                       [k] __copy_to_user_ll
     0.64%    lt-mpg123  .libs/libmpg123.so.0.33.1      [.] III_get_scale_factors_1

That makes 9.17 seconds for dequantization, 16.77 seconds for combined synth, that is a bit more and a bit less than the i386 C code. All in all, one can say that any benefit from the assembly marginal, and, depending on the mood of the compiler, not worth anything. I feel that I could go down that road, a long way down, down into the abyss, but let's avoid that. The fastest variant on this machine is what should be the fastest. Let's make MPlayer+libmpg123 live up to that.

I diagnosed some memcpy poisoning,

so let's start with switching around I/O variants to avoid memcpy. I have something prepared there. Some numbers for the differing variants (switches for packet-based input, plain demuxer reads, callback, feeder, callback with seek buffer). The memcpy number is only the actual count for memcpy() now, not this kernel-to-userspace stuff (isn't that just that I am reading from /dev/shm?).

I/O variant	packet (memcpy)	demux_read (memcpy)
feeder	27.50 s (2.14 s)	27.54 s (2.28 s)
callback	27.35 s (1.98 s)	27.58 s (2.02 s)
callback+seekbuffer	27.79 s (2.31 s)	27.52 s (2.33 s)

Big difference I see not, though some there is. The most backwards-compatible and also parsingly flexible variant is second only to the plain callback variant that cannot peek ahead in streams. I am willing to pay these 0.15 seconds for the stream parsing robustness and the fact that this API works with any release of libmpg123. Of course, the more I think of it, those 0.5% aren't nothing. Sheesh. Oh, and don't forget that the numbers have some leeway. An earlier measurement the packet feeder took 27.47 seconds. Well, rounding to 1/10th seconds would be reasonable, when jumping to final conclusions. As I am groping for any hint here, it should be fine to also look one digit further for the wobbly fine print.

Focus on the bigger picture: There is more to it than the input buffering with libmpg123. Or, hm... have a look at the call graph for the packet feeder, compared to plain mpg123 and mp3lib.

So, the rare memcpy calls in plain mpg123 are rooted in the WAV writing, not decoding. With mp3lib, the call graph doesn't give sensible information as to where the calls orginate. 99.60% from (nil). Great. What is always apparent, is the presence of those fast_memcpy() calls. They amount to 0.46% instead compared to 4.84% for the libc one. They don't occur in the call graph for libmpg123 usage from MPlayer. There, at least, is a sensible trace of the plain memcpy() calls:

     8.02%  mplayer-feeder-  /lib/i686/cmov/libc-2.11.2.so                                        [.] memcpy
                |          
                |--58.67%-- 0x10000
                |          
                |--41.15%-- decode_a_bit
                 --0.18%-- [...]

At least half of them can be accounted for. Not surprisingly, that's the feeding of data via mpg123_decode(). Half of the good two seconds spent in memcpy would already suffice to get on par with mp3lib...

But, since there is not one second difference from the feeder to the plain callback ... I guess demux_read_data() also employs simple memcpy for the same task. Hm, so... Wait! I wanted to look into that anyway, and I see confirmation in the call graph for the callback variant: There, it's mpg123_read() and decode_audio(). Yes, we come to the point where I will ponder variants to get the audio data out of libmpg123.

Well, there is mpg123_replace_reader() that should help to prevent copies, but after hacking that in, I reduced the memcpy usage, all right, but the total time didn't go down. Compare the callback variant with packets: 27.65 seconds runtime, of that 1.24 seconds memcpy(). This should be half a second faster, but it's third of a second slower!

I get that down to 27.52 s by lying about buffer sizes... that really should match what mp3lib does (directly decoding to a 4608-byte caller-provided buffer), but it doesn't. I got 9.49 s for the dequantization, 12.32 s for the synth stuff, 1.41 for do_layer and 1.24 s for memcpy. So there is no net gain at all, only more messy code.

This very much looks like a dead end here. Let's do a comparison with linking to libmpg123 statically. Interesting, that gets it down to 26.80 s. That's about 0.4 seconds more than mp3lib.

Switching back to plain normal feeder mode with packets produces 27.65 s. Activating framewise work again... that's 27.03 s. The whole deal, callback, packets, seekbuffer, framewise, is at 27.01 s. Am I being ridiculous? Well, it's close.

Well, working around memcpy is not the total breakthrough... I still have the option to use a faster memcpy in libmpg123. What about copying as 32 bit integers for a simple start? Nah, that naive approach does not improve things. So glibc memcpy is not that dumb.

Closest I can reach is 26.69 s via callback and framewise, without packets, seekbuffer and mean bitrate calculation.

On the K6-III+

I finally an claim victory for framewise decoding with a static libmpg123. It's indeed a bit faster than mp3lib. Not so on the Duron (also with -march=athlon), there, libmpg123 can only come close, but never overtake. But this is with this certain setup / tool chain. Who knows what gcc 4.6 would do?

Native binaries on Duron and K6-III+ ... and my Laptop Core2Duo at 800 MHz, 64 bit mode (on the K6 it's user time measured by the time command, the others via perf).

build Duron K6-III+ Core 2

variant1-dynamic (1.12.1) 27.74 s 57.46 s

variant1-dynamic (1.13.4) 16.72 s

variant1-dynamic (trunk) 27.81 s 57.05 s 16.87 s

variant2-dynamic 27.10 s 55.92 s 16.54 s

variant2-static 27.01 s 53.92 s 16.51 s

mp3lib 26.45 s 56.47 s 18.56 s

ffmp3 73.52 s 115.08 s 22.00 s

ffmp3float 30.74 s 118.19 s 12.26 s

variant2-float 18.76 s

So, the comparisong between mp3lib and libmpg123 does look very good on the K6-III+, not so good on the Duron. Would be interesting to have the comparison with Ivan's Athlon. On the Core2Duo there is not discussion because of the better SSE code anyway. Well, apart from the strange numbers for ffmp3float, but I'm getting to that later.

Here, the net result is that I can be 4.5% faster on the K6-III+ and down to 2.1% slower on the Duron (from 5.1%). Ivan had 8% loss (user time, 53.4 s agains 49.4 s). Oh, and the final comparison on the K6-III+ using mplayer's own benchmark mechanism:

altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3; do ./mplayer-variant2-static -loop 5 \
  -ao pcm:fast:file=/dev/null -benchmark -quiet -ac mp3 \
  /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \
  2>&1 | grep BENCH; echo "$i ^^^^^"; done
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.653s Sys:   0.051s =    6.704s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2399% Sys:  0.7601% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.659s Sys:   0.053s =    6.712s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2089% Sys:  0.7911% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.657s Sys:   0.052s =    6.708s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2291% Sys:  0.7709% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.685s Sys:   0.052s =    6.737s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2267% Sys:  0.7733% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.653s Sys:   0.050s =    6.703s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2524% Sys:  0.7476% = 100.0000%
mpg123 ^^^^^
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.600s Sys:   0.055s =    6.655s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1722% Sys:  0.8278% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.592s Sys:   0.048s =    6.640s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2755% Sys:  0.7245% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.600s Sys:   0.051s =    6.650s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2378% Sys:  0.7622% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.604s Sys:   0.052s =    6.656s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2248% Sys:  0.7752% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.596s Sys:   0.050s =    6.646s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.2493% Sys:  0.7507% = 100.0000%
mp3 ^^^^^

Damn, that's a mean user time of 6.66 s against 6.59 s for mp3lib. How subtle is that? What is the equivalent timing of the same track 10 times using my method?

altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3
  do time ./mplayer-variant2-static -loop 5 -ao pcm:fast:file=/dev/null \
  -quiet -ac mp3 \
  /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \
  2>&1 > /dev/null; echo "$i ^^^^^"; done

real    0m34.197s
user    0m33.498s
sys     0m0.636s
mpg123 ^^^^^

real    0m34.145s
user    0m33.446s
sys     0m0.656s
mp3 ^^^^^

Oh, funny. See what I did there? One should actually switch decoders. But still, interesting to see that I was about to diagnose a performance difference of 1% between mp3lib and itself. This is the proper measurement:

altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3; do ./mplayer-variant2-static\
  -loop 5 -ao pcm:fast:file=/dev/null -benchmark -quiet -ac $i \
  /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \
  2>&1 | grep BENCH; echo "$i ^^^^^"; done
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.393s Sys:   0.067s =    6.460s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 98.9636% Sys:  1.0364% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.314s Sys:   0.061s =    6.375s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.0432% Sys:  0.9568% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.327s Sys:   0.067s =    6.394s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 98.9583% Sys:  1.0417% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.316s Sys:   0.060s =    6.376s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.0572% Sys:  0.9428% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.323s Sys:   0.064s =    6.387s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.0008% Sys:  0.9992% = 100.0000%
mpg123 ^^^^^
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.682s Sys:   0.057s =    6.739s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1586% Sys:  0.8414% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.693s Sys:   0.057s =    6.750s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1543% Sys:  0.8457% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.671s Sys:   0.055s =    6.727s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1761% Sys:  0.8239% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.671s Sys:   0.058s =    6.729s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1383% Sys:  0.8617% = 100.0000%
BENCHMARKs: VC:   0.000s VO:   0.000s A:   6.670s Sys:   0.055s =    6.725s
BENCHMARK%: VC:  0.0000% VO:  0.0000% A: 99.1802% Sys:  0.8198% = 100.0000%
mp3 ^^^^^

That's a mean of 6.33 s for libmpg123 against 6.68 s for mp3lib, a decrease in decoding time of 5.2%. Of course, we got quite some variation here and I'm too lazy to compute the standard deviation, but it is apparent that mpg123 has a consistent lead here.

What is the deal with ffmp3float on my Core2Duo?

I'm comparing with libmpg123 in floating point mode (also assembly optimized). This does not seem right. The waves look good. How can this be suddenly that fast? What kind of 64 bit assembly magic is part of that?

It is suspicious that ffmp3float is not consistent. Repeated runs show differing numbers. Most interestingly such a run:

      10817,626571 task-clock                #    0,049 CPUs utilized          
            16.540 context-switches          #    0,002 M/sec                  
                19 CPU-migrations            #    0,000 M/sec                  
             3.377 page-faults               #    0,000 M/sec                  
     8.544.770.425 cycles                    #    0,790 GHz                    
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
     8.038.105.826 instructions              #    0,94  insns per cycle        
     1.302.988.705 branches                  #  120,451 M/sec                  
        41.415.237 branch-misses             #    3,18% of all branches        

     222,249003389 seconds time elapsed

Well, how do you come from 12 seconds to 222? Easy, you actually play in realtime.

MPlayer SVN-r34365-4.6.1 (C) 2000-2011 MPlayer Team

Playing pcm:file=/dev/null.
File not found: 'pcm:file=/dev/null'
Failed to open pcm:file=/dev/null.


Playing /dev/shm/convergence_-_points_of_view/03 - Listen.mp3.

This looks rather corrupt from the beginning. The command line is always the same, the interpretation differs.

Valgrind is no help with my MPlayer build:

valgrind: m_debuginfo/readdwarf.c:2162 (copy_convert_CfiExpr_tree):
  Assertion 'srcix >= 0 && srcix < VG_(sizeXA)(srcxa)' failed.

Well, I'm not here for debugging ffmp3float. Perhaps that's a bug fixed since I did my checkout of MPlayer / ffmpeg.

Summary

Let me summarize the results. Where did I start, where did I end up? This is the approximate picture here, inside MPlayer:

run	user time K6-III+ / s	user time Duron / s	ratio K6 / Duron
run	new (old)	new (old) / s	new (old)
mplayer-mp3lib	56.5 (56.5)	26.5 (25.4)	2.13 (2.24)
mplayer-mpg123	53.9 (56.5)	27.0 (28.7)	2.00 (2.33)

As there is quite some variation and also typos involved, I want to make sure about this. Let's give the machines some time to grind down the data.

Let's automate the whole thing. I need to do builds of MPlayer with all three variants of ad_mpg123 and both libmpg123 versions, the latter static, too. To reiterate, the variants are:

Variant 0: the old state of ad_mpg123.c, currently in MPlayer svn. It uses mpg123 feeder API with packets, mempcy all around.
Variant 1: same function as the old state, just #ifdef cruft removed and the decoding loop reworked a bit. Needed for currently released mpg123 versions.
Variant 2: new libmpg123 usage with decoder buffer sharing. For safety reasons, this is only activated for upcoming mpg123 releases (older versions do not offer a way to ensure that resampling is not activated on (bogus) input data changes, possibly enlarging the needed decoding buffer).

I'll use extra CFLAGS of '-march=k6-3 -mtune=k6-3' for MPlayer, '-march=k6-3 -mtune=k6-3 -O3' for mpg123. For the latter, I have to see if the builtin default set of flags is really a good idea. Has -funroll-all-loops proven to be good? Oh, I see something. I wonder if omitting frame pointers interferes with debugging on i386 and explains the difficulties I had in the profiles? But for plain MPlayer that worked fine. Ah

To be fair, let's use the same set of CFLAGS for the programs, MPlayer does this: -O4 -march=native -mtune=native -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize, I add -march=k6-3 -mtune=k6-3 to make sure the build works on the K6-3 (I'll change that to athlon for the Duron). Let's use the same flags for both:

export CFLAGS="-O4 -march=native -mtune=native -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize"
cd mplayer && ./configure && make && cd ..
cd mpg123 && ./configure --with-optimization=0 && make

Well, here is a full script to collect some reliable data on the two machines:

#!/bin/sh
# You check out MPlayer and mpg123 sources before running this script.
mplayersrc=$HOME/mplayer
mpg123src=$HOME/mpg123

# number of runs for each configuration,
# to be used for statistics in postprocessing
runs=20

# also place the old and new variant of ad_mpg123 in there
ad_mpg123=$mplayersrc/libmpcodecs/ad_mpg123.c
if test -f $ad_mpg123.old && test -f $ad_mpg123.new; then
  echo "good, ad_mpg123 sources exist"
else
  echo "missing ad_mpg123 variants"
  exit 1
fi

host=`hostname`
if test $host = duron; then
  arch=athlon
elif test $host = k6-3; then
  arch=k6-3
else
  arch=native
fi

export CFLAGS="-O4 -march=$arch -mtune=$arch -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize"

mpg123prefix=$HOME/mpg123-$arch
# prepare builds
rm -rf $mpg123prefix &&
cd $mpg123src &&
make clean &&
./configure --prefix=$mpg123prefix --with-optimization=0 --enable-static &&
make &&
make install &&
cd $mplayersrc &&
make clean &&
# prepare a first build, then hack around the libmpg123 building/linking
./configure && make &&

# add possible CPPFLAGS, like -I$mpg123prefix/include
build_ad_mpg123() {
  cc -MD -MP -Wundef -Wall -Wno-switch -Wno-parentheses \
  -Wpointer-arith -Wredundant-decls -Wstrict-prototypes -Wmissing-prototypes \
  -Wdisabled-optimization -Wno-pointer-sign -Wdeclaration-after-statement -std=gnu99 \
  -Werror-implicit-function-declaration \
  $CFLAGS "$@" \
  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE \
  -Ilibdvdread4 -I. -Iffmpeg -D_REENTRANT \
  -c -o libmpcodecs/ad_mpg123.o libmpcodecs/ad_mpg123.c
}

# only argument: the part that links mpg123
# so, -lmpg123 in the simplest case
link_mplayer() {
  # Why is there -ffast-math in the linker command?
  cc -o mplayer command.o m_property.o mixer.o mp_fifo.o \
  mplayer.o parser-mpcmd.o pnm_loader.o input/input.o \
    libao2/ao_mpegpes.o libao2/ao_null.o libao2/ao_pcm.o \
    libao2/audio_out.o libvo/aspect.o libvo/geometry.o \
    libvo/video_out.o libvo/vo_mpegpes.o libvo/vo_null.o \
    sub/spuenc.o libao2/ao_alsa.o input/appleir.o libvo/vo_fbdev.o \
    libvo/vo_fbdev2.o libvo/vo_png.o libvo/vo_md5sum.o \
    udp_sync.o libao2/ao_oss.o libvo/vo_pnm.o libvo/vo_tga.o \
    libvo/vo_v4l2.o libao2/ao_v4l2.o libvo/vo_cvidix.o \
    libvo/vosub_vidix.o vidix/vidix.o vidix/drivers.o vidix/dha.o \
    vidix/mtrr.o vidix/pci.o vidix/pci_names.o vidix/pci_dev_ids.o \
    vidix/cyberblade_vid.o vidix/mach64_vid.o vidix/mga_vid.o \
    vidix/mga_crtc2_vid.o vidix/nvidia_vid.o vidix/pm2_vid.o \
    vidix/pm3_vid.o vidix/radeon_vid.o vidix/rage128_vid.o \
    vidix/s3_vid.o vidix/sis_vid.o vidix/sis_bridge.o \
    vidix/unichrome_vid.o libvo/vo_yuv4mpeg.o asxparser.o \
    bstr.o codec-cfg.o cpudetect.o edl.o fmt-conversion.o \
    m_config.o m_option.o m_struct.o mp_msg.o mp_strings.o \
    mpcommon.o parser-cfg.o path.o playtree.o playtreeparser.o \
    subopt-helper.o libaf/af.o libaf/af_center.o libaf/af_channels.o \
    libaf/af_comp.o libaf/af_delay.o libaf/af_dummy.o libaf/af_equalizer.o \
    libaf/af_extrastereo.o libaf/af_format.o libaf/af_gate.o libaf/af_hrtf.o \
    libaf/af_karaoke.o libaf/af_pan.o libaf/af_resample.o libaf/af_scaletempo.o \
    libaf/af_sinesuppress.o libaf/af_stats.o libaf/af_sub.o libaf/af_surround.o \
    libaf/af_sweep.o libaf/af_tools.o libaf/af_volnorm.o libaf/af_volume.o \
    libaf/filter.o libaf/format.o libaf/reorder_ch.o libaf/window.o \
    libmpcodecs/ad.o libmpcodecs/ad_alaw.o libmpcodecs/ad_dk3adpcm.o \
    libmpcodecs/ad_dvdpcm.o libmpcodecs/ad_hwac3.o libmpcodecs/ad_hwmpa.o \
    libmpcodecs/ad_imaadpcm.o libmpcodecs/ad_msadpcm.o libmpcodecs/ad_pcm.o \
    libmpcodecs/dec_audio.o libmpcodecs/dec_teletext.o libmpcodecs/dec_video.o \
    libmpcodecs/img_format.o libmpcodecs/mp_image.o \
    libmpcodecs/pullup.o libmpcodecs/vd.o libmpcodecs/vd_hmblck.o \
    libmpcodecs/vd_lzo.o libmpcodecs/vd_mpegpes.o libmpcodecs/vd_mtga.o \
    libmpcodecs/vd_null.o libmpcodecs/vd_raw.o libmpcodecs/vd_sgi.o \
    libmpcodecs/vf.o libmpcodecs/vf_1bpp.o libmpcodecs/vf_2xsai.o \
    libmpcodecs/vf_blackframe.o libmpcodecs/vf_boxblur.o libmpcodecs/vf_crop.o \
    libmpcodecs/vf_cropdetect.o libmpcodecs/vf_decimate.o \
    libmpcodecs/vf_delogo.o libmpcodecs/vf_denoise3d.o libmpcodecs/vf_detc.o \
    libmpcodecs/vf_dint.o libmpcodecs/vf_divtc.o libmpcodecs/vf_down3dright.o \
    libmpcodecs/vf_dsize.o libmpcodecs/vf_dvbscale.o \
    libmpcodecs/vf_eq.o libmpcodecs/vf_eq2.o libmpcodecs/vf_expand.o \
    libmpcodecs/vf_field.o libmpcodecs/vf_fil.o libmpcodecs/vf_filmdint.o \
    libmpcodecs/vf_fixpts.o libmpcodecs/vf_flip.o libmpcodecs/vf_format.o \
    libmpcodecs/vf_framestep.o libmpcodecs/vf_gradfun.o \
    libmpcodecs/vf_halfpack.o libmpcodecs/vf_harddup.o \
    libmpcodecs/vf_hqdn3d.o libmpcodecs/vf_hue.o libmpcodecs/vf_il.o \
    libmpcodecs/vf_ilpack.o libmpcodecs/vf_ivtc.o libmpcodecs/vf_kerndeint.o \
    libmpcodecs/vf_mirror.o libmpcodecs/vf_noformat.o libmpcodecs/vf_noise.o \
    libmpcodecs/vf_ow.o libmpcodecs/vf_palette.o libmpcodecs/vf_perspective.o \
    libmpcodecs/vf_phase.o libmpcodecs/vf_pp7.o libmpcodecs/vf_pullup.o \
    libmpcodecs/vf_rectangle.o libmpcodecs/vf_remove_logo.o \
    libmpcodecs/vf_rgbtest.o libmpcodecs/vf_rotate.o libmpcodecs/vf_sab.o \
    libmpcodecs/vf_scale.o libmpcodecs/vf_smartblur.o \
    libmpcodecs/vf_softpulldown.o libmpcodecs/vf_stereo3d.o \
    libmpcodecs/vf_softskip.o libmpcodecs/vf_swapuv.o \
    libmpcodecs/vf_telecine.o libmpcodecs/vf_test.o libmpcodecs/vf_tfields.o \
    libmpcodecs/vf_tile.o libmpcodecs/vf_tinterlace.o libmpcodecs/vf_unsharp.o \
    libmpcodecs/vf_vo.o libmpcodecs/vf_yadif.o libmpcodecs/vf_yuvcsp.o \
    libmpcodecs/vf_yvu9.o libmpdemux/aac_hdr.o libmpdemux/asfheader.o \
    libmpdemux/aviheader.o libmpdemux/aviprint.o libmpdemux/demuxer.o \
    libmpdemux/demux_aac.o libmpdemux/demux_asf.o libmpdemux/demux_audio.o \
    libmpdemux/demux_avi.o libmpdemux/demux_demuxers.o \
    libmpdemux/demux_film.o libmpdemux/demux_fli.o libmpdemux/demux_lmlm4.o \
    libmpdemux/demux_mf.o libmpdemux/demux_mkv.o libmpdemux/demux_mov.o \
    libmpdemux/demux_mpg.o libmpdemux/demux_nsv.o libmpdemux/demux_pva.o \
    libmpdemux/demux_rawaudio.o libmpdemux/demux_rawvideo.o \
    libmpdemux/demux_realaud.o libmpdemux/demux_real.o libmpdemux/demux_roq.o \
    libmpdemux/demux_smjpeg.o libmpdemux/demux_ts.o libmpdemux/demux_ty.o \
    libmpdemux/demux_ty_osd.o libmpdemux/demux_viv.o libmpdemux/demux_vqf.o \
    libmpdemux/demux_y4m.o libmpdemux/ebml.o libmpdemux/extension.o \
    libmpdemux/mf.o libmpdemux/mp3_hdr.o libmpdemux/mp_taglists.o \
    libmpdemux/mpeg_hdr.o libmpdemux/mpeg_packetizer.o libmpdemux/parse_es.o \
    libmpdemux/parse_mp4.o libmpdemux/video.o libmpdemux/yuv4mpeg.o \
    libmpdemux/yuv4mpeg_ratio.o osdep/getch2.o osdep/timer-linux.o \
    stream/open.o stream/stream.o stream/stream_bd.o stream/stream_cue.o \
    stream/stream_file.o stream/stream_mf.o stream/stream_null.o \
    stream/url.o sub/eosd.o sub/find_sub.o sub/osd.o sub/spudec.o \
    sub/sub.o sub/sub_cc.o sub/subreader.o sub/vobsub.o stream/ai_alsa.o \
    stream/ai_oss.o sub/font_load.o stream/dvb_tune.o stream/stream_dvb.o \
    stream/stream_dvdnav.o libdvdnav/dvdnav.o libdvdnav/highlight.o \
    libdvdnav/navigation.o libdvdnav/read_cache.o libdvdnav/remap.o \
    libdvdnav/searching.o libdvdnav/settings.o libdvdnav/vm/decoder.o \
    libdvdnav/vm/vm.o libdvdnav/vm/vmcmd.o stream/stream_dvd.o \
    stream/stream_dvd_common.o libdvdread4/bitreader.o libdvdread4/dvd_input.o \
    libdvdread4/dvd_reader.o libdvdread4/dvd_udf.o libdvdread4/ifo_print.o \
    libdvdread4/ifo_read.o libdvdread4/md5.o libdvdread4/nav_print.o \
    libdvdread4/nav_read.o libvo/aclib.o av_helpers.o av_opts.o \
    libaf/af_lavcresample.o libmpcodecs/ad_ffmpeg.o libmpcodecs/ad_spdif.o \
    libmpcodecs/vd_ffmpeg.o libmpcodecs/vf_geq.o libmpcodecs/vf_lavc.o \
    libmpcodecs/vf_lavcdeint.o libmpcodecs/vf_pp.o libmpcodecs/vf_screenshot.o \
    libmpdemux/demux_lavf.o stream/stream_ffmpeg.o sub/av_sub.o \
    libaf/af_lavcac3enc.o libmpcodecs/vf_fspp.o libmpcodecs/vf_mcdeint.o \
    libmpcodecs/vf_qp.o libmpcodecs/vf_spp.o libmpcodecs/vf_uspp.o \
    stream/stream_ftp.o libmpcodecs/vf_bmovl.o libaf/af_export.o \
    osdep/mmap_anon.o libdvdcss/css.o libdvdcss/device.o libdvdcss/error.o \
    libdvdcss/ioctl.o libdvdcss/libdvdcss.o libmpcodecs/vd_libmpeg2.o \
    libmpeg2/alloc.o libmpeg2/cpu_accel.o libmpeg2/cpu_state.o \
    libmpeg2/decode.o libmpeg2/header.o libmpeg2/idct.o libmpeg2/motion_comp.o \
    libmpeg2/slice.o libmpeg2/idct_mmx.o libmpeg2/motion_comp_mmx.o \
    libmpcodecs/ad_mpg123.o libmpcodecs/ad_mp3lib.o mp3lib/sr1.o \
    mp3lib/decode_i586.o mp3lib/dct36_3dnow.o mp3lib/dct64_3dnow.o \
    mp3lib/dct36_k7.o mp3lib/dct64_k7.o mp3lib/dct64_mmx.o mp3lib/decode_mmx.o \
    stream/stream_rtsp.o stream/freesdp/common.o stream/freesdp/errorlist.o \
    stream/freesdp/parser.o stream/librtsp/rtsp.o stream/librtsp/rtsp_rtp.o \
    stream/librtsp/rtsp_session.o stream/stream_netstream.o \
    stream/asf_mmst_streaming.o stream/asf_streaming.o stream/cookies.o \
    stream/http.o stream/network.o stream/pnm.o stream/rtp.o \
    stream/udp.o stream/tcp.o stream/stream_rtp.o stream/stream_udp.o \
    stream/realrtsp/asmrp.o stream/realrtsp/real.o stream/realrtsp/rmff.o \
    stream/realrtsp/sdpplin.o stream/realrtsp/xbuffer.o stream/stream_pvr.o \
    libmpcodecs/ad_qtaudio.o libmpcodecs/vd_qtvideo.o libmpcodecs/ad_realaud.o \
    libmpcodecs/vd_realvid.o stream/cache2.o tremor/bitwise.o tremor/block.o \
    tremor/codebook.o tremor/floor0.o tremor/floor1.o tremor/framing.o \
    tremor/info.o tremor/mapping0.o tremor/mdct.o tremor/registry.o \
    tremor/res012.o tremor/sharedbook.o tremor/synthesis.o tremor/window.o \
    stream/stream_tv.o stream/tv.o stream/frequencies.o stream/tvi_dummy.o \
    stream/tvi_v4l.o stream/audio_in.o stream/tvi_v4l2.o sub/unrar_exec.o \
    stream/stream_vcd.o libmpcodecs/ad_libvorbis.o libmpdemux/demux_ogg.o \
    loader/wrapper.o loader/elfdll.o loader/ext.o loader/ldt_keeper.o \
    loader/module.o loader/pe_image.o loader/pe_resource.o loader/registry.o \
    loader/resource.o loader/win32.o libmpcodecs/ad_acm.o libmpcodecs/ad_dmo.o \
    libmpcodecs/ad_dshow.o libmpcodecs/ad_twin.o libmpcodecs/vd_dmo.o \
    libmpcodecs/vd_dshow.o libmpcodecs/vd_vfw.o libmpcodecs/vd_vfwex.o \
    libmpdemux/demux_avs.o loader/afl.o loader/drv.o loader/vfl.o \
    loader/dshow/DS_AudioDecoder.o loader/dshow/DS_Filter.o \
    loader/dshow/DS_VideoDecoder.o loader/dshow/allocator.o \
    loader/dshow/cmediasample.o loader/dshow/graph.o loader/dshow/guids.o \
    loader/dshow/inputpin.o loader/dshow/mediatype.o loader/dshow/outputpin.o \
    loader/dmo/DMO_AudioDecoder.o loader/dmo/DMO_VideoDecoder.o \
    loader/dmo/buffer.o loader/dmo/dmo.o loader/dmo/dmo_guids.o \
    libmpcodecs/vd_xanim.o osdep/shmem.o \
    ffmpeg/libpostproc/libpostproc.a ffmpeg/libswscale/libswscale.a \
    ffmpeg/libavfilter/libavfilter.a ffmpeg/libavformat/libavformat.a \
    ffmpeg/libavcodec/libavcodec.a ffmpeg/libavutil/libavutil.a \
    "$@" -Wl,-z,noexecstack -lm  -ffast-math \
    -lncurses -lasound -ldl -lpthread -lpthread -ldl -rdynamic
}

# Be careful with the copying!
# Debian has messed up scripting so that the x bit normally is not copied!
# That and the strange format of `time` output when called in a script, is, strange.

cp -av $ad_mpg123.old $ad_mpg123 &&
build_ad_mpg123 &&
link_mplayer -lmpg123 &&
cp -av mplayer mplayer-variant0-dynamic-sys &&
link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 &&
cp -av mplayer mplayer-variant0-dynamic-trunk &&
link_mplayer $mpg123prefix/lib/libmpg123.a &&
cp -av mplayer mplayer-variant0-static-trunk &&

cp -av $ad_mpg123.new $ad_mpg123 &&
build_ad_mpg123 &&
link_mplayer -lmpg123 &&
cp -av mplayer mplayer-variant1-dynamic-sys &&
link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 &&
cp -av mplayer mplayer-variant1-dynamic-trunk &&
link_mplayer $mpg123prefix/lib/libmpg123.a &&
cp -av mplayer mplayer-variant1-static-trunk &&

cp -av $ad_mpg123.new $ad_mpg123 &&
build_ad_mpg123 -I$mpg123prefix/include &&
# also variant 2 with old lib, not recommened,
# but should still work (or at least fail gracefully)
link_mplayer -lmpg123 &&
cp -av mplayer mplayer-variant2-dynamic-sys &&
link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 &&
cp -av mplayer mplayer-variant2-dynamic-trunk &&
link_mplayer $mpg123prefix/lib/libmpg123.a &&
cp -av mplayer mplayer-variant2-static-trunk || exit 1

time_run() {
  time "$@"
}

mark_run() {
  echo
  echo "=== This was run $1 of $2 ==="
  echo
}

album=/dev/shm/convergence_-_points_of_view
# some runs for mpg123 to set the baseline
for d in mpg123 $mpg123prefix/bin/mpg123
do
  for n in `seq 1 20`
  do
    time_run $d -w /dev/null $album/*.mp3
    mark_run $n $d
  done
done

# now the mplayers
for d in $mplayersrc/mplayer-variant*
do
  for ac in mp3 mpg123
  do
  for n in `seq 1 20`
  do
    time_run $d -ac $ac -quiet -ao pcm:fast:file=/dev/null $album/*.mp3
    mark_run $n $d:$ac
  done
  done
done

I'll let this run on the two machines over the day...

This is the result for user times... arithmetic mean over 20 runs each, with $\sigma$ being the standard deviation (rounded just like the mean, so there can be full zeroes, one should always wiggle the last digit, though). The mplayer builds are labelled with the ad_mpg123 variant number, a ``d'' for dynamic, ``s'' for static linking to libmpg123, followed by ``s'' for system libmpg123 (a bit older, usually) or ``t'' for a checkout of mpg123 svn trunk, built with optimized flags.

	K6-III+@550		Duron@800		C2@800		Semp@800
run	ts	$\sigma/$ s	ts	$\sigma/$ s	ts	$\sigma/$ s	ts	$\sigma/$ s
mpg123-sys	51.5	0.3	22.5	0.1	15.5	0.0	14.8	0.0
mpg123-trunk	50.6	0.3	22.4	0.1	15.4	0.0	14.8	0.0
mpl0ds-mp3lib	56.0	0.4	25.5	0.1	18.6	0.1	19.3	2.0
mpl0ds-mpg123	56.0	0.3	26.1	0.1	16.5	0.1	15.6	0.1
mpl0dt-mp3lib	56.2	0.4	25.5	0.1	18.6	0.0	19.3	2.0
mpl0dt-mpg123	55.1	0.3	25.8	0.1	16.3	0.0	15.5	0.1
mpl0st-mp3lib	56.5	0.3	25.6	0.2	18.6	0.0	19.3	1.8
mpl0st-mpg123	55.0	0.2	*25.5*	0.1	16.3	0.0	15.5	0.1
mpl1ds-mp3lib	56.2	0.4	25.5	0.1	18.6	0.1	20.5	1.8
mpl1ds-mpg123	57.2	0.3	26.9	0.1	16.4	0.0	15.7	0.1
mpl1dt-mp3lib	56.2	0.3	25.5	0.1	18.6	0.0	20.3	1.9
mpl1dt-mpg123	56.2	0.3	26.8	0.1	16.3	0.0	15.7	0.1
mpl1st-mp3lib	56.6	0.3	25.6	0.2	18.6	0.0	19.6	1.8
mpl1st-mpg123	56.3	0.3	26.6	0.1	16.3	0.0	15.7	0.1
mpl2dt-mp3lib	56.4	0.4	25.5	0.1	18.3	0.0	19.2	1.9
mpl2dt-mpg123	*54.5*	0.3	26.1	0.1	*16.0*	0.0	*15.4*	0.1
mpl2st-mp3lib	56.6	0.5	25.6	0.1	18.3	0.0	19.8	1.9
mpl2st-mpg123	*54.5*	0.3	25.9	0.0	*16.0*	0.0	15.5	0.1
best mp3lib	56.0	0.4	25.5	0.1	18.3	0.0	19.2	1.9
best mpg123	54.5	0.3	25.5	0.1	16.0	0.0	15.4	0.1
best variant	mpl2dt		mpl0st		mpl2dt		mpl2dt

The systems involved are the K6-III+ at 550 MHz clock and Duron at 800 MHz, equipped with PC100 and PC133 SDRAM, respectively, and both running Debian Squeeze. Then, there is my workhorse Thinkpad X200 with a Core2Duo P8600 clocked at 800 MHz for the test, supported by some DDR3-1066 RAM (two modules). The OS here is a custom Source Mage GNU/Linux x86-64 with gcc 4.6, glibc 2.11.2 (I got my reasons;-) and kernel 3.2.1 with that ASPM fix. The last entry is a media box with a Sempron 140 unlocked to Athlon II, also running at 800 MHz for the test. RAM is also DDR3 here, but two modules of DDR3-1333. OS is ubuntu 10.04 LTS x86-64.

It's interesting that the Duron/Athlon is capable of keeping the Core2Duo in check; probably assisted by that faster RAM, though I do wonder how much effect that has since the relevant code should nicely fit into the cache. But the audio data won't, of course. And the Core2Duo has more cache. Whatever. Both machines are in 64 bit mode and are using the 64 bit SSE code in mpg123. And not to forget that this is the inverse picture for mp3lib: That is significantly slower, the spread on Sempron meaning that mp3lib can be about 33% slower than the mpg123 binding!That puts the idea about faster RAM to the test, eh?

The good

bit is that MPlayer with libmpg123 is able to beat mp3lib or at least come on par. Also, the best variant is almost always the new code using a common buffer for decoding, avoiding some memcpy and thus saving time, aggreeing to theory.

The bad

bit is that the Duron plays tricks with me. It is the one system where libmpg123 is not really able to be faster than mp3lib (when used from MPlayer, of course, stand-alone mpg123 beats it easily). And it is the one system where the initial, in theory less efficient, code for the libmpg123 binding is faster than the newest iteration.

Also, the numbers in this table are noticeably different from the initial measurements. I attribute this to the statistics, and changed CFLAGS... in any case, there is really quite some variation of individual values going on. One might wonder why (at least for the Sempron?) there are rather clear influences of built type on mp3lib performance. The mp3lib code has not been touched.

Variant 1 vs. variant 0

Well, I did change something going from 0 to 1: The decoding loop is different to support the frame decoding. That difference amounts to about a whole second runtime difference, based on around 26 seconds total time. This is about 4%. Heck, this is not limited to the Duron -- look closer! This means also almost 2% hit on the K6-III+ and also 1% on the Sempron. The Core2Duo doesn't complain at all.

Changing the decoding loop order doesn't do the trick. This must be a strange side-effect. Really, the code logic (when switching the loop around) is the same. Just the #ifdefs around are different.

Really, it doesn't make sense. The calls to libmpg123 are the same, the order of feeding vs. decoding in the loop doesn't make a difference... Hacking the variant 1 feeding decode to also return decoded data (reducing calls to mpg123_decode) worsens the situation. I'm lost here and also fed up. This is insanity. The next gcc might reversen the situation. So, for now, just use fresh MPlayer with fresh mpg123 to get variant 2 and be done with the small performance hit in that specific configuration. Or, if you lean that way, get on with debugging what the heck the compiler is doing there, and also make sure you find the reason why mp3lib tends to have different modes of operation with differing variance of runtime depending on what ad_mpg123 code is built (easy to spot in the Sempron runs).

One can still argue that there is something sub-optimal with our 3DNowExt assembly (although it works wonderfully for the K6-III+), specific to the Duron's small cache. But then, profiling does not suggest that there is excess time spent in that code. Would have to be a side effect making other code slower because of cache trashing. If that's the case, however, this effect should be smaller on a proper Athlon (Thunderbird, at least). It could be that this is the reason for the effect being half as big for the K6-III+, aleady. But the Sempron does not employ the 3DNowExt code and also has not exactly a small cache. Though, I admit that the size of impact indeed does seem to correlate with L2 cache size: We got 64K, 256K, 1M, 3M. Still, it's apples against oranges with the differing platforms (toolchain).

But if you need 3M of cache to cover the hot code / data, then I'd conclude that MPlayer is rather fat there. I guess that switching off the MPEG stream parsing there could help, as that code is doubled in libmpg123, too. I also keep noticing routines checking for AC3 stuff in MPlayer. I dunno why they are needed here. There is too much code running.

And my mysery is a bit offset by another way at looking at mp3lib: Comparing the Core2Duo with the Sempron we have plain mpg123 runtime of 15.4 vs. 14.8 s, Sempron is 0.6 s faster. From within MPlayer, using dynamic trunk libmpg123, it is rather consistently around 0.6 s faster, too. So, this all seams reasonable, once you accept that the Sempron has a better time with the x86-64 assemblly code.

With mp3lib, there is a spread from 18.3 to 18.6 s on the Core2Duo, and on the Sempron from 20.8 to 21.6; the direct comparison of runs shows 2.3 to 3 s increase in runtime. Now explain why the Sempron has a harder time with MPlayer's 64 bit SSE assembly code, a time harder by 5%; and comparing that to mpg123 on the same box, it's 33%, even. With that in mind, 4% unexplained performance hit on the Duron is not that bad, is it?

Flaky mp3lib

The final set of runs presented here has rather consistent performance of mp3lib on the Sempron, but with really large spread around the mean. Earlier runs had somewhat smaller error range but higher mean runtime. There is definetely more strangeness going on with performance in mp3lib, and perhaps by extension generally with the (my?) MPlayer binaries, than with the libmpg123 binding. 'nuff said.

Conclusion

Well, when you install a new MPlayer and new mpg123, you can expect an increase in MP3 decoding performance. There are certain platforms that defy logic, or at least bring out the utmost fuzzyness out of it. That fuzzyness is embodied somehow in the build of MPlayer, changing efficiency of parts of code all over the place without directly touching them. The build on the Duron gives, scaled relative to plain mpg123 performance, compared to K6-III+, both mp3lib and mpg123 an extra hit, but delivers a bigger blow to mpg123. The situation is on full reverse, sort of, on a Sempron.

That strangeness in mind, I hereby declare that I want to keep what's left of my sanity and also that the new variant of ad_mpg123 should go into MPlayer's trunk and is rightfully obsoleting mp3lib. If someone opposes that, that someone should first give me a consistent picture of what exactly the mp3lib performance is;-)

Addendum: A closer look at the Duron, victory after all!

Now, after recognizing that current ffmpeg indeed improved the mp3float decoder to surpass mpg123 on a Core2 CPU, with the help of SSE handcrafting and, more importantly, an approach to Huffman decoding that fits that CPU a lot better, I checked the situation again on the Duron.

There, the bare comparison between ffmpeg and mpg123, when both decode to floating point numbers, showed two things:

Head-to-head, mpg123 is faster than ffmpeg's mp3float decoder.
The current dct36 optimizations for 3DNow and 3DNowExt are actually worse than their plain C counterparts (generic or i386). The synth_1to1 for 16 bit output still rules (though I will have to investigate how well the dct64 really works). Upshoot of that is: I can make mpg123 a bit faster by disabling the optimized dct36! This improves the lead of mpg123 over ffmpeg on this machine and also offers a chance to be able to say that we truly beat good old mp3lib here.

With that latest change, the picture finally fits on the Duron. I re-run some benchmarks with updated MPlayer checkout, usage of ffmp3float is also included, but note that the comparison is actually skewed in favour of ffmpeg as it does not do the final conversion to 16 bits (it would need a bit more time with that).

I repeat the big table for the current ad_mpg123 in MPlayer (leaving only variants 1 and 2), updating the columns for the Duron with the new numbers. Note that I assume, but do not know, that the K6-III+ would react similar to the Duron. I'll have to run the updated test there, too... some time.

	K6-III+@550		Duron@800 new		C2@800		Semp@800
run	ts	$\sigma/$ s	ts	$\sigma/$ s	ts	$\sigma/$ s	ts	$\sigma/$ s
mpg123-sys	51.5	0.3	22.5	0.1	15.5	0.0	14.8	0.0
mpg123-trunk	50.6	0.3	21.3	0.1	15.4	0.0	14.8	0.0
mpl1ds-mp3lib	56.2	0.4	25.5	0.1	18.6	0.1	20.5	1.8
mpl1ds-mpg123	57.2	0.3	27.0	0.1	16.4	0.0	15.7	0.1
mpl1ds-ffmp3float			29.5	0.1
mpl1dt-mp3lib	56.2	0.3	25.5	0.1	18.6	0.0	20.3	1.9
mpl1dt-mpg123	56.2	0.3	25.6	0.1	16.3	0.0	15.7	0.1
mpl1dt-ffmp3float			29.7	0.2
mpl1st-mp3lib	56.6	0.3	25.5	0.1	18.6	0.0	19.6	1.8
mpl1st-mpg123	56.3	0.3	25.5	0.1	16.3	0.0	15.7	0.1
mpl1st-ffmp3float			29.7	0.2
mpl2dt-mp3lib	56.4	0.4	25.5	0.1	18.3	0.0	19.2	1.9
mpl2dt-mpg123	*54.5*	0.3	24.9	0.1	*16.0*	0.0	*15.4*	0.1
mpl2dt-ffmp3float			29.6	0.1
mpl2st-mp3lib	56.6	0.5	25.5	0.1	18.3	0.0	19.8	1.9
mpl2st-mpg123	*54.5*	0.3	*24.7*	0.1	*16.0*	0.0	15.5	0.1
mpl2st-ffmp3float			29.7	0.1
best mp3lib	56.0	0.4	25.5	0.1	18.3	0.0	19.2	1.9
best mpg123	54.5	0.3	24.7	0.1	16.0	0.0	15.4	0.1
best variant	mpl2dt		mpl2st		mpl2dt		mpl2dt

So, with an upcoming mpg123 release, mp3lib is clearly beaten. The situation with ffmp3float is also under control on non-SSE machines (or generally non-Core2?). Investigations are underway to get mpg123 up to speed on modern CPUs, at least on par with ffmpeg.

Also, that the hand-crafted DCT routines are overtaken by the compiler's own optimizing is actually a good thing. Of course, I'll have to check if this also applies to the K6-III+ and newer Athlons. If that is the case, we can delete old code for good. On the other hand: There might be potential for even better routines using 3DNow; question is if we care enough for that now obsoleted instruction set... or if we want to go freakish and use 3DNow and SSE at the same time for maximum register count.

About this document ...

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html -split 0 -local_icons beating_mp3lib_in_mplayer.tex

The translation was initiated by on 2012-03-25

Footnotes

... clear): I re-measured the times for K6-III as there was a typo before.
... CPUs.: Alas, that is limited to 16 bit output, currently.
... meager: I had to build a kernel for K6 to support performance events at all, as opposed to the debian i486 kernel.
... script: Needs adaption on other systems in the MPlayer link line, of course.
... binding!: at least when also taking the error range into account
... MPlayer: I have to check if with actually current MPlayer trunk everything still works, but I won't repeat the lengthy test series.
... mpg123: to-be-released 1.14

2012-03-25

build	Duron	K6-III+	Core 2
variant1-dynamic (1.12.1)	27.74 s	57.46 s
variant1-dynamic (1.13.4)			16.72 s
variant1-dynamic (trunk)	27.81 s	57.05 s	16.87 s
variant2-dynamic	27.10 s	55.92 s	16.54 s
variant2-static	27.01 s	53.92 s	16.51 s
mp3lib	26.45 s	56.47 s	18.56 s
ffmp3	73.52 s	115.08 s	22.00 s
ffmp3float	30.74 s	118.19 s	12.26 s
variant2-float			18.76 s