Over a year ago, I contributed am mpg123 binding for MPlayer to replace the old fork of mpg123 code dubbed "mp3lib". I can say "old" since we incorporated the added optimizations from that one plus added new stuff from Taihei;-)
Generally, mpg123 performs well compared to MPlayer, but there is a performance hit when using libmpg123 from MPlayer, making it slower than the built-in mp3lib. This is mostly apparent on older systems, where possibly cache misses are a source of trouble.
Specifically, the regression has been reported on K6-III+, of which I happen to have a unit available for testing. It has been confirmed for Athlons (no XP), too. Well, I happen to have a good old Duron 800 at my disposal. That should do the trick there.
The software baseline is a current stable debian squeeze (installed from 6.0.3 media), with MPlayer and mpg123 pulled from SVN and built enforcing binaries for a K6-III (also on the Duron), adding in 3DNowExt optimizations via custom assembly on either side.
Furthermore, the data set is, for the sake of comparability with earlier results, the MP3 album "Points of View" by "Convergence". To make sure, I talk of these files with the respective MD5 hashes:
04d953da593d8cfa43aee4d00572ca37 \ /dev/shm/convergence_-_points_of_view/01 - Bleed.mp3 2999166bab6fe443f6a2c15c8eb5d938 \ /dev/shm/convergence_-_points_of_view/02 - Strike the end.mp3 d324e12b3f20bea69c9c5e01252d0d96 \ /dev/shm/convergence_-_points_of_view/03 - Listen.mp3 e0215065002ef2492a8d88c356609c21 \ /dev/shm/convergence_-_points_of_view/04 - Six feet under.mp3 2b19190e15b5e5a805ccfacd789b01ab \ /dev/shm/convergence_-_points_of_view/05 - Always the same.mp3 502c4d1e5793352fb25e6e90d6480415 \ /dev/shm/convergence_-_points_of_view/06 - Breath.mp3 80bffd4e4bd6cc2f330b08e8156070ad \ /dev/shm/convergence_-_points_of_view/07 - Vanished memories.mp3 de5ebd6e36eebf8c77642fa52f9610bd \ /dev/shm/convergence_-_points_of_view/08 - Silent.mp3 c4c1167b5ca29c061d75f6104eec9283 \ /dev/shm/convergence_-_points_of_view/09 - Nothing else.mp3 df624cee3f7b65be9b98e449e122ac63 \ /dev/shm/convergence_-_points_of_view/10 - Train to leave.mp3
As you may notice: I try to eliminate the hard disk at least for the test data by moving stuff to RAM. Even the K6-III+ has no issue with sparing some of its 512 MiB of RAM (the duron setup has 768 MiB).
The command lines try to do comparable things in mpg123 and MPlayer:
mplayer -ac mp3 -quiet -ao pcm:file=/dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 mplayer -ac mpg123 -quiet -ao pcm:file=/dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 mpg123 -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3
I'm trying to combine timings with good old 'time' and measurements with 'perf'. The latter does not give much information on the K6-III+ with respect to cache misses, so I hope to do the grunt work on the Duron and hopefully the findings will transfer to the K6.
As an additional measure to avoid a penalty for libmpg123 loading, the libmpg123.a is linked in statically into MPlayer. The final solution will hopefully make the dynamic lib perform about as well (considering that stand-alone mpg123 does this successfully).
The build of MPlayer is configured via
./configure --extra-cflags='-march=k6-3 -mtune=k6-3 -g'the mpg123 build via
CFLAGS="-march=k6-3 -mtune=k6-3 -g -O3" ./configure --disable-shared
That results in this set of CFLAGS: -O2
-fomit-frame-pointer
-funroll-all-loops
-finline-functions
-ffast-math
-march=k6-3
-mtune=k6-3
-g
-O3
Those are largely similar to what MPlayer uses, but not identical. Thinking about this... some of those might be even damaging. But: As long as standalone mpg123 has such a clear lead over MPlayer, I doubt that those make much of a difference.
How do my K6-III binaries perform on the two systems? Well, let's record the user time to decode the test album (no statistics as long as the result is rather clear).
run | user time K6-III+ / s | user time Duron / s | ratio K6 / Duron |
mpg123-debian | 51.6 | 22.4 | 2.30 |
mpg123 | 50.4 | 22.3 | 2.26 |
mplayer-mp3lib | 56.5 | 25.4 | 2.24 |
mplayer-mpg123 | 56.5 | 28.7 | 2.33 |
This is using run-time choice of 3DNowExt routines. It might be interesting to compare runtimes of generic C code, the difference attesting to the respective assembly routine quality. But let's stay focused on the case that is important in practice, and that is using 3DNowExt as the most performant choice on these CPUs.Let's express runtime as percent of plain mpg123 runtime:
run | user time K6-III+ | user time Duron |
mpg123-debian | 102.4 % | 100.4 % |
mpg123 | 100.0 % | 100.0 % |
mplayer-mp3lib | 112 % | 113.9 % |
mplayer-mpg123 | 112 % | 128.7 % |
Ivan noted a performance penalty of 8 % on his Athlon. Well, I can offer even more: MPlayer with mpg123 binding adds 13 % on the runtime of MPlayer with mp3lib. Cache size could explain that somewhat, as the Duron does have less of that than an Athlon, though it was not considered that much of a performance difference back then when people cared about comparing Athlons and Durons. Anyhow, good starting point to dig at the reason for the difference, eh? I do remember trying out generic decoders on the K6-III+, there the effect was even larger. If things are unclear, I can resort to that.
Now let's hook up with Linux Performance Events. On the K6-III+, the information was rather meager, I hope for more stats on the Duron as a more modern / better supported CPU. But perhaps it's just my novice usage of perf, which I didn't know before. Oh, hm. Let's see. Ah, yes... all the goodies Ivan used in his perf invocation relies on a PMU being present. That's not the case for the K6.
altwarentester@k6-3:~/mplayer$ perf record -f \ -e cycles:u -e instructions:u -e l1-dcache-loads:u \ -e l1-dcache-load-misses:u -e branch-misses \ ./mplayer -ac mp3 -quiet -ao pcm:file=/dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 Error: perfcounter syscall returned with -1 (No such device) Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?It seems the only diagnostic I get is cycles, and that's not very special. On the Duron:
altwarentester@duron:~/mplayer$ perf record -f \ -e cycles:u -e instructions:u -e l1-dcache-loads:u \ -e l1-dcache-load-misses:u -e branch-misses \ ./mplayer -ac mp3 -quiet -ao pcm:file=/dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 Error: failed to mmap with 1 (Operation not permitted)Well. That's nearly success. I have to increase the allowed locked memory count from the default 64 K.
Now let's see what we've got. Hm, recording works, but I don't get how to report all that wonderful information.
Plain perf report
doesn't tell more than usual. I can do that call graph thing, but also here, there is still some strangeness going on with the mpg123 symbols. I'll figure that out.
In the meantime, there is perf stat
, which gives a nice overall picture:
altwarentester@duron:~/mplayer$ perf stat ./mplayer -ac mpg123 \ -quiet -ao pcm:file=/dev/null /dev/shm/convergence_-_points_of_view/*.mp3 29435.786923 task-clock-msecs # 0.994 CPUs 216 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6780 page-faults # 0.000 M/sec 23502185554 cycles # 798.422 M/sec 22870306002 instructions # 0.973 IPC 8562390418 cache-references # 290.884 M/sec 17327364 cache-misses # 0.589 M/sec 29.605841362 seconds time elapsed
altwarentester@duron:~/mplayer$ perf stat ./mplayer -ac mp3 \ -quiet -ao pcm:file=/dev/null /dev/shm/convergence_-_points_of_view/*.mp3 26210.959074 task-clock-msecs # 0.993 CPUs 207 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6290 page-faults # 0.000 M/sec 20923114721 cycles # 798.258 M/sec 20818859901 instructions # 0.995 IPC 7867755901 cache-references # 300.170 M/sec 10962558 cache-misses # 0.418 M/sec 26.387378954 seconds time elapsed
altwarentester@duron:~$ perf stat mpg123/src/mpg123 -w /dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 23163.487659 task-clock-msecs # 0.996 CPUs 140 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 337 page-faults # 0.000 M/sec 18501736597 cycles # 798.746 M/sec 20408582208 instructions # 1.103 IPC 7871105536 cache-references # 339.807 M/sec 11304577 cache-misses # 0.488 M/sec 23.246529422 seconds time elapsed
The K6 is not that generous with details (not counted from cycles on). Regarding cache misses, that's 17 million for mplayer-mpg123, 11 million for mplayer-mp3 and a similar 11 million for plain mpg123. OK, it is sub-optimal. I know that. Now how to dig deeper? I do wonder if switching to generic C code would help since this reduces the amount of functions involved.
Let's try that: Force MPlayer not to activate special optimizations, tell mpg123 to stick to generic decoding. Hm. I see that mp3lib generic code is actually the i386 code. Well, then, two can play that game.
altwarentester@duron:~$ perf stat mpg123/src/mpg123 --cpu i386 \ -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3 44296.192117 task-clock-msecs # 0.996 CPUs 263 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 335 page-faults # 0.000 M/sec 35386827671 cycles # 798.868 M/sec 34153833893 instructions # 0.965 IPC 10346222151 cache-references # 233.569 M/sec 13440804 cache-misses # 0.303 M/sec 44.465953954 seconds time elapsed
altwarentester@duron:~/mplayer-generic$ perf stat ./mplayer -ac mp3 \ -quiet -ao pcm:file=/dev/null /dev/shm/convergence_-_points_of_view/*.mp3 32032.688338 task-clock-msecs # 0.994 CPUs 233 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6284 page-faults # 0.000 M/sec 25575954000 cycles # 798.433 M/sec 28423479288 instructions # 1.111 IPC 9061572976 cache-references # 282.885 M/sec 13821093 cache-misses # 0.431 M/sec 32.228199936 seconds time elapsed
altwarentester@duron:~/mplayer-generic$ perf stat ./mplayer -ac mpg123 \ -quiet -ao pcm:file=/dev/null /dev/shm/convergence_-_points_of_view/*.mp3 49121.685292 task-clock-msecs # 0.995 CPUs 309 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6781 page-faults # 0.000 M/sec 39219721233 cycles # 798.420 M/sec 35350587000 instructions # 0.901 IPC 10824302596 cache-references # 220.357 M/sec 21094060 cache-misses # 0.429 M/sec 49.370325203 seconds time elapsed
Wait a moment... This looks a lot like mpg123's C code being mashed into an inferior binary form. It's still slower from MPlayer than from stand-alone mpg123, but the comparison to mp3lib is harsh. But then, wait a bit more: Is MPlayer really playing fair? Did I hack mp3lib correctly? Those 32 seconds aren't that much more than 26 seconds. Let's record.
Hm, well... it looks like generic code, alright. Building mpg123 with -O4
and the rest of the MPlayer CFLAGS doesn't change too much: runtime stays around 44 seconds for standalone mpg123.
But the comparison is still somewhat interesting: Cache is not the point here.
altwarentester@duron:~/mpg123-mpopt$ perf stat src/mpg123 --cpu i386 \ -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3 44294.299679 task-clock-msecs # 0.996 CPUs 272 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 325 page-faults # 0.000 M/sec 35383207571 cycles # 798.821 M/sec 34414257475 instructions # 0.973 IPC 10373238416 cache-references # 234.189 M/sec 17922320 cache-misses # 0.405 M/sec 44.451762052 seconds time elapsedWell, that fits with the earlier comparison: Differences in cache hit or miss are not really significant, the mpg123 C code plainly sucks!
So, I got two issues at hand:
Let'c check mpg123-0.68, which predates the move to the thread-safe library, with the MPlayer CFLAGS:
altwarentester@duron:~/mpg123-0.68$ perf stat src/mpg123 --cpu i386 \ -w /dev/null /dev/shm/convergence_-_points_of_view/*.mp3 43475.950745 task-clock-msecs # 0.996 CPUs 264 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 317 page-faults # 0.000 M/sec 34730969332 cycles # 798.855 M/sec 34317362404 instructions # 0.988 IPC 10323913086 cache-references # 237.463 M/sec 12847122 cache-misses # 0.295 M/sec 43.628917385 seconds time elapsedDoes not look like that much of an extra hit. Well, about 2 % difference to the current version there. So, it's not nothing. But in that range I lean to argue that there has to be some cost involved in making the code thread-safe, with all that data in the handle and al. Of course, my approach to libmpg123 should have room for cleanup and optimization per se, but we are not talking about 2 % here, compared to mp3lib. This is more serious than the simple transition to libmpg123.
It is difficult to test earlier versions of mpg123 since they happen not to like my test data. One should not forget that I added bug fixes over time! Well, I suspect my approach to CPU runtime optimization choice. Found a set of old MP3s to work on (``Dune'', by ... eh... ``Dune'')... and the difference is appaling:
altwarentester@duron:~/mpg123-0.59r$ perf stat ./mpg123 -t \ /home/altwarentester/dune/*.mp3 23312.117777 task-clock-msecs # 0.996 CPUs 152 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 14344 page-faults # 0.001 M/sec 18623038218 cycles # 798.857 M/sec 24254735169 instructions # 1.302 IPC 8597454473 cache-references # 368.798 M/sec 4262514 cache-misses # 0.183 M/sec 23.394514625 seconds time elapsed
altwarentester@duron:~/mpg123-0.68$ perf stat ./mpg123 -t \ /home/altwarentester/dune/*.mp3 32211.610636 task-clock-msecs # 0.996 CPUs 199 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 296 page-faults # 0.000 M/sec 25731964846 cycles # 798.841 M/sec 30810795898 instructions # 1.197 IPC 10904946603 cache-references # 338.541 M/sec 7285056 cache-misses # 0.226 M/sec 32.326412050 seconds time elapsed
Frick. That is a huge hit. Hah! I even forgot to specify --cpu i386
. With that, I'm at around 70 seconds.
Something smells fishy.
Comparing 0.59r and 0.68 furher... decoding one of the dune tracks, where the old version needs 2.5 seconds and the new one over 6 seconds. Reports:
altwarentester@duron:~/mpg123-0.59r$ perf report # Samples: 15180 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 45.76% mpg123 ./mpg123 [.] III_dequantize_sample 20.47% mpg123 ./mpg123 [.] dct64 15.01% mpg123 ./mpg123 [.] dct36 9.65% mpg123 ./mpg123 [.] do_layer3 3.41% mpg123 ./mpg123 [.] synth_1to1 1.98% mpg123 ./mpg123 [.] dct12 1.03% mpg123 /lib/i686/cmov/libc-2.11.2.so [.] memcpy
altwarentester@duron:~/mpg123-0.68$ perf report # Samples: 44859 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 65.42% mpg123 ./src/mpg123 [.] synth_1to1_i386 15.60% mpg123 ./src/mpg123 [.] III_dequantize_sample 7.63% mpg123 ./src/mpg123 [.] dct64_i386 4.91% mpg123 ./src/mpg123 [.] dct36 3.07% mpg123 ./src/mpg123 [.] do_layer3 0.70% mpg123 ./src/mpg123 [.] dct12 0.34% mpg123 ./src/mpg123 [.] III_get_scale_factors_1 0.33% mpg123 [kernel] [k] __copy_to_user_ll 0.12% mpg123 [kernel] [k] generic_file_aio_read
I seriously borked up the synth function for this CPU. But remember: This is code that is not run in the initially considered case, only when I enforce avoidance of the optimized decoders. But it could be symptomatic.
Fun question: Does MPLayer with the i386 mp3lib decode correctly, like current mpg123? Yes it does. Result looks good. Computing time looks better.
Last resort of mpg123 without multi-cpu code: 0.65. Will that produce proper output? Will it be faster? It produces crap. Faster. Only mpg123-0.66 finally produces proper output with i386 decoder. But is that a generic property? Shouldn't I have noticed such breakage before? I don't see it mentioned in NEWS.
There it is:
------------------------------------------------------------------------ r577 | thor | 2007-02-22 03:04:23 +0100 (Do, 22. Feb 2007) | 4 Zeilen fixing i386 code with gcc-4.1 by using the old WRITE_SAMPLE again - the new one has distorted sound and gcc complains about strict aliasing ... the old WRITE_SAMPLE even seems to be faster with gcc-3.4! Of course I didn't (yet) test on my real i386DX40 if there's an advantage. Started mocking with cpuflags to get proper MMX and SSE detection. ------------------------------------------------------------------------Well, there it may be. Let's switch around
WRITE_SAMPLE
. And we got a change from
2566.163526 task-clock-msecs # 0.994 CPUs 21 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 300 page-faults # 0.000 M/sec 2047880985 cycles # 798.032 M/sec 2041781623 instructions # 0.997 IPC 798943574 cache-references # 311.338 M/sec 4049486 cache-misses # 1.578 M/sec 2.581971582 seconds time elapsedto
6272.432955 task-clock-msecs # 0.989 CPUs 51 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 300 page-faults # 0.000 M/sec 5006116015 cycles # 798.114 M/sec 4398072978 instructions # 0.879 IPC 1320889896 cache-references # 210.587 M/sec 4730705 cache-misses # 0.754 M/sec 6.343654290 seconds time elapsedVery well. Seems like I have to learn from MPlayer about fixing up the new
WRITE_SAMPLE
.
And yes, there is something in revision 7300; something about gcc 3.x being too clever for the double - int conversion trick. Applying the fix from MPlayer to mpg123 0.65 yields
5213.636059 task-clock-msecs # 0.985 CPUs 46 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 301 page-faults # 0.000 M/sec 4161133383 cycles # 798.125 M/sec 4203986268 instructions # 1.010 IPC 1302169739 cache-references # 249.762 M/sec 4555437 cache-misses # 0.874 M/sec 5.293784289 seconds time elapsed(and good-sounding output, of course).
Going back to the original test album, writing to /dev/null (the stuff before wrote to a real file to check the output):
35322.795994 task-clock-msecs # 0.996 CPUs 215 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 315 page-faults # 0.000 M/sec 28215905365 cycles # 798.802 M/sec 32624122756 instructions # 1.156 IPC 10324164674 cache-references # 292.281 M/sec 12229060 cache-misses # 0.346 M/sec 35.457957266 seconds time elapsedThat's not 70 seconds.
WRITE_SAMPLE
is a sore spot, but that it is that influential on the Duron is a surprise. I have to check on a more modern system, with a 32 bit install.
But where came the 70 seconds from anyway? The current mpg123 had only about 44 seconds. This CPU is sensitive. And don't forget: I observed the old WRITE_SAMPLE
even being faster back then!
Anyhow, 35 seconds is still above the 32 seconds mp3lib achieves. There is more to the generic/i386 C code. Stay tuned.
Found MPlayer rev. 28866. Reducing the size of bandInfoStruct bring that:
35303.691456 task-clock-msecs # 0.996 CPUs 215 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 314 page-faults # 0.000 M/sec 28199408110 cycles # 798.767 M/sec 32624433817 instructions # 1.157 IPC 10330702614 cache-references # 292.624 M/sec 14060449 cache-misses # 0.398 M/sec 35.430962884 seconds time elapsedIn short: nothing. The preceeding rev. 28865 has some
const
stuff. That might play a role. Then there is Zuxy's rev. 23485 that changed some static vars to automatic. Generally, he worked on SSE stuff, that's not of interest right now.
There is the fast_memcpy
business. I should check that. Rev. 22158 does buffer alignment.
Revision 18946 fixes something in the 3DNow DCT64 ... and makes we wonder what exactly is the relation of mpg123's and MPlayers plain 3DNow code. That is another matter to investigate.
For now, it seems that it is fruitful enough to pursue merging of apparent improvements in the generic/overall C code of mp3lib that we indeed still miss in mpg123.
Hacking the ``new'' WRITE_SAMPLE
into current mpg123 does yield a tiny improvement over the hacked 0.65, even.
altwarentester@duron:~/mpg123$ perf stat ./src/mpg123 -w /dev/null \ --cpu i386 /dev/shm/convergence_-_points_of_view/*.mp3 35030.997934 task-clock-msecs # 0.996 CPUs 225 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 333 page-faults # 0.000 M/sec 27983581759 cycles # 798.823 M/sec 33233941274 instructions # 1.188 IPC 10325844134 cache-references # 294.763 M/sec 11774779 cache-misses # 0.336 M/sec 35.176724921 seconds time elapsed
altwarentester@duron:~/mpg123-0.65$ perf stat ./src/mpg123 -w /dev/null \ /dev/shm/convergence_-_points_of_view/*.mp3 35215.369977 task-clock-msecs # 0.995 CPUs 226 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 314 page-faults # 0.000 M/sec 28126526481 cycles # 798.700 M/sec 32624128394 instructions # 1.160 IPC 10335841084 cache-references # 293.504 M/sec 12821386 cache-misses # 0.364 M/sec 35.387834336 seconds time elapsedToo bad that this doesn't fit the current framework of separated clipping and conversion too well. Seems like, at least for some CPUs (I really should wire up the i486, or even the i386...), this separation hurts. But then: We are talking about the i386 C code, which is not used by default from old Pentiums onwards. This is rather academic. The performance of the plain C code doesn't matter at all in practice, when there's nowadays a variant of Taihei's SSE code in use.
In MPlayer, this change improves the situation, but it is still 39 seconds against 32 seconds:
perf stat ./mplayer -ac mp3 -ao pcm:file=/dev/null -quiet \ /dev/shm/convergence_-_points_of_view/*.mp3 32049.354400 task-clock-msecs # 0.994 CPUs 249 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6293 page-faults # 0.000 M/sec 25589163911 cycles # 798.430 M/sec 28424067520 instructions # 1.111 IPC 9060506927 cache-references # 282.705 M/sec 12323035 cache-misses # 0.385 M/sec 32.251952962 seconds time elapsed
perf stat ./mplayer -ac mpg123 -ao pcm:file=/dev/null -quiet \ /dev/shm/convergence_-_points_of_view/*.mp3 39542.041678 task-clock-msecs # 0.994 CPUs 298 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6790 page-faults # 0.000 M/sec 31575386518 cycles # 798.527 M/sec 34341985738 instructions # 1.088 IPC 10817342380 cache-references # 273.566 M/sec 17943896 cache-misses # 0.454 M/sec 39.787800467 seconds time elapsed
But how does the picture of CPU usage per function look now? With mp3lib:
36.40% mplayer ./mplayer [.] synth_1to1 28.16% mplayer ./mplayer [.] III_dequantize_sample 10.59% mplayer ./mplayer [.] dct64_1 9.05% mplayer ./mplayer [.] dct36 4.23% mplayer ./mplayer [.] T.293 3.92% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpyWith libmpg123:
40.43% mplayer ./mplayer [.] INT123_synth_1to1_i386 22.31% mplayer ./mplayer [.] III_dequantize_sample 7.68% mplayer ./mplayer [.] INT123_dct36 6.73% mplayer ./mplayer [.] INT123_dct64_i386 6.07% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 3.51% mplayer ./mplayer [.] INT123_synth_ntom 3.35% mplayer ./mplayer [.] INT123_do_layer3 1.25% mplayer ./mplayer [.] feed_readThat's still quite some overhang for the synth function itself. And even if it's totally irrelevant in practice... I want to see how I messed up the synth function for Durons.
WRITE_SAMPLE
change. using i386 decoder.
I might be convinced to give it 2 % advantage.
It's harder to tell a difference when using the generic decoder, but there might be one. When building for -march=pentium
, both differences are around 1 to 1.5 %.
One would need some statistics for more definite answer, but the main point is: The effect, if any, is much less pronounced on the Core2Duo machine. Might be a reason for me not spotting anything wrong. The message behind this, of course, is that tuning things for the Duron should not hurt performance on newer machines (which use SSE optimization anyway, d'oh).
WRITE_SAMPLE
resulting in floating point stack messiness.
Seems like the summand for the fadd
is kept on the fp stack in mp3lib, but is grabbed from main stack repeatedly for mpg123. One indicator is the appearance of the fxch
instruction in mpg123's synth_1to1
function.
It is totally absent in mp3lib's version.
Now the question is how I can rearrange to code to avoid this. I am working on the 0.65 source to stay closer to mp3lib.
synth_1to1
being all-static.
That eases optimzation of (floating point) stack usage.
An interesting side note is that for the mpg123 code, the floating point stack messup seems to be pushed by enabling optimization at all.
But anyway, repackaging the stuff to have the direct calls to the synth functions together with the (then static) functions might do the trick.
For modern mpg123 this would mean packaging all synth functions and layer[123].c
files together with optimise.c
, perhaps frame.c
, into one compilation unit.
An indicator could be this: If there are public symbols that are not part of the API in a static libmpg123, those symbols should have been static and not public.
Of course, performance is only one possible reason to make these functions static: It is bad library design to have those symbols floating around and possibly colliding when someone uses the static library.
I tried this on the Core2Duo and, if anything, this makes things perhaps slightly worse. On the Duron the tendency to worsen the performance is also indicated. At least the more static build is about 0.5 K smaller. The effect of adapting Dan's patch for const huffman tables is bigger (both in performance and binary size). But still, this is not really big. I need to get to the core of the bad WRITE_SAMPLE translation. How does the compiler get the hint from MPlayer code? And... how about building with the Intel or open64 compilers for a change? Well, one bummer is that I have those installed for 64 bits, not 32.
On the Duron, the effect of the const tables is also quite small.
Just make sure I still know what I am talking about, mpg123-0.65-like-mplayer r3005, my old MPlayer checkout, both with i386 C code. Perf says for MPlayer:
32331.247743 task-clock-msecs # 0.994 CPUs 371 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6301 page-faults # 0.000 M/sec 25814521225 cycles # 798.439 M/sec 28424759003 instructions # 1.101 IPC 9068719643 cache-references # 280.494 M/sec 13063911 cache-misses # 0.404 M/sec 32.523356972 seconds time elapsed # Samples: 270005 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 35.82% mplayer ./mplayer [.] synth_1to1 28.14% mplayer ./mplayer [.] III_dequantize_sample 10.57% mplayer ./mplayer [.] dct64_1 9.18% mplayer ./mplayer [.] dct36 4.19% mplayer ./mplayer [.] T.293 4.07% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 0.56% mplayer ./mplayer [.] fast_memcpyAnd for mpg123:
34845.338614 task-clock-msecs # 0.996 CPUs 377 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 245 page-faults # 0.000 M/sec 27835529165 cycles # 798.831 M/sec 32659069064 instructions # 1.173 IPC 10169107285 cache-references # 291.836 M/sec 8112060 cache-misses # 0.233 M/sec 34.982723275 seconds time elapsed # Samples: 292093 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 51.58% mpg123 ./src/mpg123 [.] synth_1to1 24.78% mpg123 ./src/mpg123 [.] III_dequantize_sample 8.20% mpg123 ./src/mpg123 [.] dct64 7.84% mpg123 ./src/mpg123 [.] dct36 3.73% mpg123 ./src/mpg123 [.] do_layer3 0.52% mpg123 [kernel] [k] __copy_to_user_ll 0.31% mpg123 ./src/mpg123 [.] dct12 0.25% mpg123 ./src/mpg123 [.] III_get_scale_factors_1 0.17% mpg123 /lib/i686/cmov/libc-2.11.2.so [.] __mempcpyIt is still apparent that synth of mpg123 needs more time. 51.6% of 35 seconds, that is, 18 seconds. For MPlayer, that's 35.8% of 32.5 secods. 11.6 seconds. Let's summarize that.
function | time MPlayer / s | time mpg123 / s |
synth_1to1 | 11.64 | 18.04 |
III_dequantize_sample | 9.15 | 8.67 |
dct64 | 3.44 | 2.87 |
dct36 | 2.99 | 2.74 |
T.293 / do_layer3 | 1.36 | 1.30 |
That does look interesting. If only the synth function were faster, mpg123 would easily beat mp3lib. All other main functions are a tad quicker, even, than the mp3lib counterparts!
The perf annotate
ist not too helpful in placing the blame on certain code portions, but I know already what the main difference is. Or do I?
The annotations show huge runtime hit for the faddp instruction in mpg123, where it doesn't really count in MPlayer, at the sample places.
The output of perf annotate
is confusing. The association of code and assembly seems rather whacky. But first, what is that?
: for (j=16;j;j--,b0+=0x10,window+=0x20,samples+=step) 0.14 : 806ca36: 83 c3 40 add $0x40,%ebx 0.64 : 806ca39: 83 c2 40 add $0x40,%edx 0.01 : 806ca3c: 83 e9 80 sub $0xffffff80,%ecx 0.22 : 806ca3f: 83 c6 04 add $0x4,%esi 1.11 : 806ca42: 81 fb 00 04 00 00 cmp $0x400,%ebx 0.00 : 806ca48: 0f 84 22 01 00 00 je 806cb70 <synth_1t 0.10 : 806ca4e: dd d8 fstp %st(0) 0.83 : 806ca50: dd d8 fstp %st(0) 0.12 : 806ca52: dd d8 fstp %st(0) 0.02 : 806ca54: dd d8 fstp %st(0) 0.47 : 806ca56: dd d8 fstp %st(0) 0.09 : 806ca58: dd d8 fstp %st(0)Such floating point stack messing around is not to be found, in MPlayer. Then, the detail:
: sum += window[0x4] * b0[0x4]; 1.18 : 82478f2: d9 42 20 flds 0x20(%edx) 0.60 : 82478f5: d8 49 20 fmuls 0x20(%ecx) 0.11 : 82478f8: de c1 faddp %st,%st(1) : sum -= window[0x5] * b0[0x5]; 0.71 : 82478fa: d9 42 28 flds 0x28(%edx) 0.46 : 82478fd: d8 49 28 fmuls 0x28(%ecx) 0.17 : 8247900: de c1 faddp %st,%st(1)And a similar portion from mpg123:
: sum += window[0x4] * b0[0x4]; 0.36 : 806caca: d9 44 24 34 flds 0x34(%esp) 0.00 : 806cace: d8 49 20 fmuls 0x20(%ecx) 1.90 : 806cad1: de c7 faddp %st,%st(7) : sum -= window[0x5] * b0[0x5]; 0.35 : 806cad3: d9 44 24 2c flds 0x2c(%esp) 0.00 : 806cad7: d8 49 28 fmuls 0x28(%ecx) 2.09 : 806cada: de c7 faddp %st,%st(7)The exact attribution to one instruction might be tough here, but it is apparent that very similar pieces of code perform consistently different. The location on the floating point stack is different. And, hm, why are array offsets different? This pattern is carried through: The MPlayer code has matching offsets on the load and multiplication operations, especially matching the array offsets in the C code. In mpg123, that is different.
Now, that leaves the strange decisions of the compiler on the table. It does weird stuff with the floating point sums. Generally, the floating point code is messy. What's the magic compiler switch to change that?
I added some disassembly and the corresponding source to SVN, the public may have a peek at http://mpg123.orgis.org/cgi-bin/scm/mpg123/notes/beating_mp3lib_in_mplayer_material/.
Separating out the synth function in the MPlayer code into its own object file (not static, obviously) yields binary code very similar to what gcc produces for mpg123.
119,121c170,172 < 93: d9 41 20 flds 0x20(%ecx) < 96: d8 4a 20 fmuls 0x20(%edx) < 99: de c2 faddp %st,%st(2) --- > 109: d9 44 24 28 flds 0x28(%esp) > 10d: d8 4a 20 fmuls 0x20(%edx) > 110: de c7 faddp %st,%st(7) 123,125c174,176 < 9b: d9 41 28 flds 0x28(%ecx) < 9e: d8 4a 28 fmuls 0x28(%edx) < a1: de c2 faddp %st,%st(2) --- > 112: d9 44 24 20 flds 0x20(%esp) > 116: d8 4a 28 fmuls 0x28(%edx) > 119: de c7 faddp %st,%st(7)Now, as I said, a function call messing with stack/register handling is not entirely spurprising. But, still, I should wrap my head around what exactly happens here. Especially, since including static dct64 didn't seem to help mpg123. But what does it do here? Let's try.
Gah! Static! I forgot that there are static variables in the synth function (there aren't anymore for current mpg123). I simply defined the static keyword to nothing for the separated object, only thinking about the static function, not static variables. Does the situation change when I just remove the static keyword from the function?
156,158c223,225 < b2: d9 42 20 flds 0x20(%edx) < b5: d8 49 20 fmuls 0x20(%ecx) < b8: de c1 faddp %st,%st(1) --- > 144: d9 44 24 34 flds 0x34(%esp) > 148: d8 49 20 fmuls 0x20(%ecx) > 14b: de c7 faddp %st,%st(7) 160,162c227,229 < ba: d9 42 28 flds 0x28(%edx) < bd: d8 49 28 fmuls 0x28(%ecx) < c0: de c1 faddp %st,%st(1) --- > 14d: d9 44 24 2c flds 0x2c(%esp) > 151: d8 49 28 fmuls 0x28(%ecx) > 154: de c7 faddp %st,%st(7)OK, that's still the same, basically. Only the stack location changes. Now, how does the stuff land on the stack anyways?
This is creepy. I hacked around with the mp3lib code for a while, making functions non-static, cleaning away distracting parts... and then this: The optional optimized synth function is disabled at runtime by my hacking, but the call was still present in the synth.
if ( synth_func ) { // printf("Calling %p, bandPtr=%p channel=%d samples=%p\n",synth_func,bandPtr,channel,samples); // FIXME: synth_func() may destroy EBP, don't rely on stack contents!!! return (*synth_func)( bandPtr,channel,samples); }Now, if I remove that function call,
if ( synth_func ) { // printf("Calling %p, bandPtr=%p channel=%d samples=%p\n",synth_func,bandPtr,channel,samples); // FIXME: synth_func() may destroy EBP, don't rely on stack contents!!! /* return (*synth_func)( bandPtr,channel,samples); */ }I suddenly get my bad code delivered! Let's check the actual performance... Yup, with the call present (but inactive), we have the usual 32 seconds, with the call removed, it's 40 seconds.
OK... so I see now how I could fix the generic decoding to be better. Via some nasty trick I might convince gcc to produce the correct output for mpg123; or, I even capture the good assembly and have the i386 code as assembly optimization, which is a bit crazy, given that there is the i586 optimization anyway.
But first, let's verify the concept, I added this little fake function someplace else than decode_i386.c (so that the symbol exists and the function call code actually is generated for lack of link-time inlining).
int synth_1to1_fake(real *bandPtr,int channel,unsigned char *out,int *pnt) { return 0; }Then, the bogus fake call gets added to the synth.
int moody_switch = 0; DECODE_SCOPE int synth_1to1(real *bandPtr,int channel,unsigned char *out,int *pnt) { #ifndef PENTIUM_OPT static real buffs[2][2][0x110]; static const int step = 2; static int bo = 1; short *samples = (short *) (out + *pnt); real *b0,(*buf)[0x110]; int clip = 0; int bo1; *pnt += 128; /* moved that here as MPlayer has it */ #endif if(moody_switch) return synth_1to1_fake(bandPtr, channel, out, pnt);Well, the result...
28522.600345 task-clock-msecs # 0.996 CPUs 302 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 246 page-faults # 0.000 M/sec 22784727766 cycles # 798.831 M/sec 27729388895 instructions # 1.217 IPC 9218523630 cache-references # 323.201 M/sec 7898355 cache-misses # 0.277 M/sec 28.650573967 seconds time elapsed
Nifty, eh? Now I have a grip on how to hack utmost speed out of the generic C code on x86 with gcc. That is an interesting point, but it is not the main matter of this investigation. Let's look at what the binding of libmpg123 to mplayer does to hurt performance by 30%.
Pack your things and go, mp3lib, you're owned! Well, owned by plain mpg123, that is. Libmpg123 still has some way to go.
Without the hands tied behind its back, mpg123 is faster than MPlayer (using 3DNowExt optimizations). But the binding of libmpg123 to MPlayer brings great suffering. Let's investigate that.
MPlayer's mp3lib has such a profile,
# Samples: 217361 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 34.08% mplayer ./mplayer [.] III_dequantize_sample 17.89% mplayer ./mplayer [.] synth_1to1_MMX 16.40% mplayer ./mplayer [.] dct36_3dnowex 11.61% mplayer ./mplayer [.] dct64_MMX_3dnowex 5.27% mplayer ./mplayer [.] T.293 4.85% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpywhile with libmpg123 usage, it is
# Samples: 248096 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 30.14% mplayer ./mplayer [.] III_dequantize_sample 26.41% mplayer ./mplayer [.] 0x000000009470c1 9.41% mplayer ./mplayer [.] 0x00000000950941 8.04% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 7.30% mplayer ./mplayer [.] INT123_synth_ntom 4.62% mplayer ./mplayer [.] INT123_do_layer3 1.69% mplayer ./mplayer [.] feed_read 1.15% mplayer ./mplayer [.] INT123_dct64 0.97% mplayer ./mplayer [.] INT123_synth_1to1_3dnowextWait. What?! Ntom? And the actual synth with 0.97%? That is fishy. The figures may be right, but the names aren't. Perhaps the static link messes things up a bit (but why should it?). Let's link to the system libmpg123.
# Samples: 225783 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 82.32% mplayer /usr/lib/libmpg123.so.0.25.1 [.] 0x00000000028247 7.92% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 1.00% mplayer /usr/lib/libmpg123.so.0.25.1 [.] 0x000000000147c4 0.65% mplayer [kernel] [k] __copy_to_user_ll 0.53% mplayer /lib/i686/cmov/libc-2.11.2.so [.] 0x000000000705dd 0.31% mplayer ./mplayer [.] avpriv_ac3_parse_header 0.31% mplayer /lib/i686/cmov/libc-2.11.2.so [.] __mempcpy 0.28% mplayer ./mplayer [.] ac3_eac3_probe 0.23% mplayer ./mplayer [.] demux_audio_fill_buffer 0.19% mplayer /usr/lib/libmpg123.so.0.25.1 [.] mpg123_decodeWhut? Get out already. What is that? Ok, fresh build, dynamic link to local libmpg123.
# Samples: 245675 # # Overhead Command Shared Object Symbol # ........ ........... ............................ ...... # 36.09% mplayer .libs/libmpg123.so.0.33.1 [.] 0x00000000043685 31.83% mplayer .libs/libmpg123.so.0.33.1 [.] III_dequantize_sample 8.23% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 4.86% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_synth_ntom 4.65% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_do_layer3 1.67% mplayer .libs/libmpg123.so.0.33.1 [.] feed_read 1.25% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_dct64 1.06% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_synth_1to1_3dnowextWtf? First: This needs more time than system mpg123, second: Those symbols can't be right.
Checking this... Heck, the NtoM synth is called! And, know what? It's because of a bug in the libmpg123 parser in feeder mode (I really should not have added this contraption): It did not seek back 4 bytes if the ahead check past the first frame does not work. This pecuilar bug is triggered because MPlayer also parses the MPEG data and manages to give exactly one frame of data at the beginning.
Now, for some more sensible numbers and names...
27269.086270 task-clock-msecs # 0.993 CPUs 322 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6858 page-faults # 0.000 M/sec 21765861127 cycles # 798.188 M/sec 20944822745 instructions # 0.962 IPC 8198643751 cache-references # 300.657 M/sec 16856742 cache-misses # 0.618 M/sec 27.465041648 seconds time elapsed # Samples: 228283 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 34.38% mplayer .libs/libmpg123.so.0.33.1 [.] III_dequantize_sample 30.03% mplayer .libs/libmpg123.so.0.33.1 [.] 0x00000000043e7d 10.94% mplayer .libs/libmpg123.so.0.33.1 [.] 0x00000000042e2d 7.84% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 4.97% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_do_layer3 1.23% mplayer .libs/libmpg123.so.0.33.1 [.] INT123_synth_1to1_3dnowext 0.66% mplayer .libs/libmpg123.so.0.33.1 [.] III_get_scale_factors_1 0.63% mplayer [kernel] [k] __copy_to_user_llNow that looks better, but only covers a bug I must have introduced recently (yes, revision 2935 must've been it). Those 27.5 seconds are still rather bad compared to plain mpg123:
24365.715653 task-clock-msecs # 0.991 CPUs 372 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 7600 page-faults # 0.000 M/sec 19462031681 cycles # 798.747 M/sec 20804412868 instructions # 1.069 IPC 8193661424 cache-references # 336.278 M/sec 13213580 cache-misses # 0.542 M/sec 24.576109613 seconds time elapsed # Samples: 200099 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 46.39% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] 0x000000000413ac 38.52% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] III_dequantize_sample 5.81% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] INT123_do_layer3 1.45% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] INT123_synth_1to1_3dnowext 0.78% lt-mpg123 [kernel] [k] __copy_to_user_ll 0.74% lt-mpg123 /lib/i686/cmov/libc-2.11.2.so [.] __mempcpy 0.66% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] III_get_scale_factors_1I must assume that this first block encompasses all synth functions combined (asm aynth and the DCTs).
How is the absolute comparison to mp3lib, per function? First the raw data again...
26190.932710 task-clock-msecs # 0.993 CPUs 333 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 6307 page-faults # 0.000 M/sec 20908670446 cycles # 798.317 M/sec 20816640224 instructions # 0.996 IPC 7871360378 cache-references # 300.538 M/sec 11813272 cache-misses # 0.451 M/sec 26.370606238 seconds time elapsed # Samples: 217594 # # Overhead Command Shared Object Symbol # ........ ........... ................................... ...... # 34.25% mplayer ./mplayer [.] III_dequantize_sample 17.89% mplayer ./mplayer [.] synth_1to1_MMX 16.45% mplayer ./mplayer [.] dct36_3dnowex 11.35% mplayer ./mplayer [.] dct64_MMX_3dnowex 5.32% mplayer ./mplayer [.] T.293 4.83% mplayer /lib/i686/cmov/libc-2.11.2.so [.] memcpy 0.65% mplayer [kernel] [k] __copy_to_user_ll 0.59% mplayer ./mplayer [.] synth_1to1
As a bonus, I compare also with good old mpg123 0.65 (the hacked i386 code). It is eerily close.
28434.455457 task-clock-msecs # 0.996 CPUs 298 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 246 page-faults # 0.000 M/sec 22713949581 cycles # 798.818 M/sec 27729289453 instructions # 1.221 IPC 9204425341 cache-references # 323.707 M/sec 7354561 cache-misses # 0.259 M/sec 28.561796156 seconds time elapsed # Samples: 234591 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 40.85% mpg123 ./src/mpg123 [.] synth_1to1 30.31% mpg123 ./src/mpg123 [.] III_dequantize_sample 10.18% mpg123 ./src/mpg123 [.] dct64 9.77% mpg123 ./src/mpg123 [.] dct36 4.62% mpg123 ./src/mpg123 [.] do_layer3 0.64% mpg123 [kernel] [k] __copy_to_user_ll 0.34% mpg123 ./src/mpg123 [.] dct12 0.32% mpg123 ./src/mpg123 [.] III_get_scale_factors_1 0.23% mpg123 /lib/i686/cmov/libc-2.11.2.so [.] __mempcpyI get just that bit of time for all that assembly code? SIMD and all?
For mpg123, the individual asm functions cannot be separated for some reason, so let's combine the numbers:
time | time | time | time | |
MPlayer | MPlayer | mpg123 | mpg123 | |
function | +mp3lib / s | +libmpg123 / s | trunk / s | 0.65-i386 / s |
synth & DCT | 12.20 | 11.59 | 11.75 | 17.36 |
III_dequantize_sample | 9.03 | 9.44 | 9.46 | 8.66 |
T.293 / do_layer3 | 1.40 | 1.36 | 1.43 | 1.31 |
memcpy+to_user | 1.31 | 2.33 | 0.37 | 0.25 |
sum of above | 23.94 | 24.72 | 23.01 | 27.58 |
diff to mpg123 | 0.93 | 1.71 | 0 | 4.57 |
total runtime diff | 1.79 | 2.89 | 0 | 3.98 |
A more subtle point is to be made about the assembly synths in mpg123 being a bit faster (yay, despite criticism about inefficient stack handling;-), while, again, MPlayer wins on the efficiency of GCC-generated code. They managed to befriend to optimizer somehow.
Since I like to explore to the sidess before going for the bulls eye, I am going to have a look at what exactly is different in the sample dequantization. This comparison is messy. Looking at that huge slab of code with CPU time distributed all over the place, but somewhat prominent on code portions that are rather identical between mp3lib and mpg123, I think it is not worth it no nail down the optimizer here. That half of a second can come and go.
Well, but what could be worth it is to investigate what the heck happened since mpg123 0.65 here. I am comparing to the plain C code for i386 (well, boosted a bit by the recent improvements / hack for GCC optimizer)! I see the 3DNowExt assembly shaving off 5.61 seconds from the synth stuff, but here, 0.8 seconds are lost in the dequantization again. Eh, I have to take a step back. I just glossed over half a second compared to mp3lib, I can ignore 0.8 seconds, too. Really, I have seen what GCC is capable of with the rather simple synth code. I will not start a battle about 0.8 seconds in III_dequantize_sample. I need some lifetime left. Let's be happy that in the sum, 14% of decoding performance is gained from 3DNowExt.
What spooks me here, is that the plain 3DNow code makes mpg123 even slower than the beefed-up generic code:
29208.504107 task-clock-msecs # 0.995 CPUs 389 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 4151 page-faults # 0.000 M/sec 23331202619 cycles # 798.781 M/sec 26197929267 instructions # 1.123 IPC 9571621109 cache-references # 327.700 M/sec 13034794 cache-misses # 0.446 M/sec 29.341045891 seconds time elapsed # Samples: 248648 # # Overhead Command Shared Object Symbol # ........ ........... ............................. ...... # 31.26% lt-mpg123 ./libs/libmpg123.so.0.33.1 [.] III_dequantize_sample 30.25% lt-mpg123 ./libs/libmpg123.so.0.33.1 [.] 0x00000000041684 26.01% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] 0x00000000042611 4.54% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] INT123_do_layer3 0.89% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] INT123_synth_1to1_3dnow 0.73% lt-mpg123 /lib/i686/cmov/libc-2.11.2.so [.] __mempcpy 0.64% lt-mpg123 [kernel] [k] __copy_to_user_ll 0.64% lt-mpg123 .libs/libmpg123.so.0.33.1 [.] III_get_scale_factors_1That makes 9.17 seconds for dequantization, 16.77 seconds for combined synth, that is a bit more and a bit less than the i386 C code. All in all, one can say that any benefit from the assembly marginal, and, depending on the mood of the compiler, not worth anything. I feel that I could go down that road, a long way down, down into the abyss, but let's avoid that. The fastest variant on this machine is what should be the fastest. Let's make MPlayer+libmpg123 live up to that.
I/O variant | packet (memcpy) | demux_read (memcpy) |
feeder | 27.50 s (2.14 s) | 27.54 s (2.28 s) |
callback | 27.35 s (1.98 s) | 27.58 s (2.02 s) |
callback+seekbuffer | 27.79 s (2.31 s) | 27.52 s (2.33 s) |
Focus on the bigger picture: There is more to it than the input buffering with libmpg123. Or, hm... have a look at the call graph for the packet feeder, compared to plain mpg123 and mp3lib.
So, the rare memcpy calls in plain mpg123 are rooted in the WAV writing, not decoding. With mp3lib, the call graph doesn't give sensible information as to where the calls orginate. 99.60% from (nil). Great. What is always apparent, is the presence of those fast_memcpy() calls. They amount to 0.46% instead compared to 4.84% for the libc one. They don't occur in the call graph for libmpg123 usage from MPlayer. There, at least, is a sensible trace of the plain memcpy() calls:
8.02% mplayer-feeder- /lib/i686/cmov/libc-2.11.2.so [.] memcpy | |--58.67%-- 0x10000 | |--41.15%-- decode_a_bit --0.18%-- [...]At least half of them can be accounted for. Not surprisingly, that's the feeding of data via mpg123_decode(). Half of the good two seconds spent in memcpy would already suffice to get on par with mp3lib...
But, since there is not one second difference from the feeder to the plain callback ... I guess demux_read_data() also employs simple memcpy for the same task. Hm, so... Wait! I wanted to look into that anyway, and I see confirmation in the call graph for the callback variant: There, it's mpg123_read() and decode_audio(). Yes, we come to the point where I will ponder variants to get the audio data out of libmpg123.
Well, there is mpg123_replace_reader() that should help to prevent copies, but after hacking that in, I reduced the memcpy usage, all right, but the total time didn't go down. Compare the callback variant with packets: 27.65 seconds runtime, of that 1.24 seconds memcpy(). This should be half a second faster, but it's third of a second slower!
I get that down to 27.52 s by lying about buffer sizes... that really should match what mp3lib does (directly decoding to a 4608-byte caller-provided buffer), but it doesn't. I got 9.49 s for the dequantization, 12.32 s for the synth stuff, 1.41 for do_layer and 1.24 s for memcpy. So there is no net gain at all, only more messy code.
This very much looks like a dead end here. Let's do a comparison with linking to libmpg123 statically. Interesting, that gets it down to 26.80 s. That's about 0.4 seconds more than mp3lib.
Switching back to plain normal feeder mode with packets produces 27.65 s. Activating framewise work again... that's 27.03 s. The whole deal, callback, packets, seekbuffer, framewise, is at 27.01 s. Am I being ridiculous? Well, it's close.
Well, working around memcpy is not the total breakthrough... I still have the option to use a faster memcpy in libmpg123. What about copying as 32 bit integers for a simple start? Nah, that naive approach does not improve things. So glibc memcpy is not that dumb.
Closest I can reach is 26.69 s via callback and framewise, without packets, seekbuffer and mean bitrate calculation.
-march=athlon
), there, libmpg123 can only come close, but never overtake.
But this is with this certain setup / tool chain. Who knows what gcc 4.6 would do?
Native binaries on Duron and K6-III+ ... and my Laptop Core2Duo at 800 MHz, 64 bit mode (on the K6 it's user time measured by the time
command, the others via perf).
build | Duron | K6-III+ | Core 2 |
variant1-dynamic (1.12.1) | 27.74 s | 57.46 s | |
variant1-dynamic (1.13.4) | 16.72 s | ||
variant1-dynamic (trunk) | 27.81 s | 57.05 s | 16.87 s |
variant2-dynamic | 27.10 s | 55.92 s | 16.54 s |
variant2-static | 27.01 s | 53.92 s | 16.51 s |
mp3lib | 26.45 s | 56.47 s | 18.56 s |
ffmp3 | 73.52 s | 115.08 s | 22.00 s |
ffmp3float | 30.74 s | 118.19 s | 12.26 s |
variant2-float | 18.76 s |
Here, the net result is that I can be 4.5% faster on the K6-III+ and down to 2.1% slower on the Duron (from 5.1%). Ivan had 8% loss (user time, 53.4 s agains 49.4 s). Oh, and the final comparison on the K6-III+ using mplayer's own benchmark mechanism:
altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3; do ./mplayer-variant2-static -loop 5 \ -ao pcm:fast:file=/dev/null -benchmark -quiet -ac mp3 \ /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \ 2>&1 | grep BENCH; echo "$i ^^^^^"; done BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.653s Sys: 0.051s = 6.704s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2399% Sys: 0.7601% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.659s Sys: 0.053s = 6.712s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2089% Sys: 0.7911% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.657s Sys: 0.052s = 6.708s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2291% Sys: 0.7709% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.685s Sys: 0.052s = 6.737s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2267% Sys: 0.7733% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.653s Sys: 0.050s = 6.703s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2524% Sys: 0.7476% = 100.0000% mpg123 ^^^^^ BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.600s Sys: 0.055s = 6.655s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1722% Sys: 0.8278% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.592s Sys: 0.048s = 6.640s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2755% Sys: 0.7245% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.600s Sys: 0.051s = 6.650s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2378% Sys: 0.7622% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.604s Sys: 0.052s = 6.656s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2248% Sys: 0.7752% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.596s Sys: 0.050s = 6.646s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.2493% Sys: 0.7507% = 100.0000% mp3 ^^^^^Damn, that's a mean user time of 6.66 s against 6.59 s for mp3lib. How subtle is that? What is the equivalent timing of the same track 10 times using my method?
altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3 do time ./mplayer-variant2-static -loop 5 -ao pcm:fast:file=/dev/null \ -quiet -ac mp3 \ /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \ 2>&1 > /dev/null; echo "$i ^^^^^"; done real 0m34.197s user 0m33.498s sys 0m0.636s mpg123 ^^^^^ real 0m34.145s user 0m33.446s sys 0m0.656s mp3 ^^^^^Oh, funny. See what I did there? One should actually switch decoders. But still, interesting to see that I was about to diagnose a performance difference of 1% between mp3lib and itself. This is the proper measurement:
altwarentester@k6-3:~/mplayer$ for i in mpg123 mp3; do ./mplayer-variant2-static\ -loop 5 -ao pcm:fast:file=/dev/null -benchmark -quiet -ac $i \ /dev/shm/convergence_-_points_of_view/10\ -\ Train\ to\ leave.mp3 \ 2>&1 | grep BENCH; echo "$i ^^^^^"; done BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.393s Sys: 0.067s = 6.460s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 98.9636% Sys: 1.0364% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.314s Sys: 0.061s = 6.375s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.0432% Sys: 0.9568% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.327s Sys: 0.067s = 6.394s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 98.9583% Sys: 1.0417% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.316s Sys: 0.060s = 6.376s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.0572% Sys: 0.9428% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.323s Sys: 0.064s = 6.387s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.0008% Sys: 0.9992% = 100.0000% mpg123 ^^^^^ BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.682s Sys: 0.057s = 6.739s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1586% Sys: 0.8414% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.693s Sys: 0.057s = 6.750s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1543% Sys: 0.8457% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.671s Sys: 0.055s = 6.727s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1761% Sys: 0.8239% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.671s Sys: 0.058s = 6.729s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1383% Sys: 0.8617% = 100.0000% BENCHMARKs: VC: 0.000s VO: 0.000s A: 6.670s Sys: 0.055s = 6.725s BENCHMARK%: VC: 0.0000% VO: 0.0000% A: 99.1802% Sys: 0.8198% = 100.0000% mp3 ^^^^^That's a mean of 6.33 s for libmpg123 against 6.68 s for mp3lib, a decrease in decoding time of 5.2%. Of course, we got quite some variation here and I'm too lazy to compute the standard deviation, but it is apparent that mpg123 has a consistent lead here.
It is suspicious that ffmp3float is not consistent. Repeated runs show differing numbers. Most interestingly such a run:
10817,626571 task-clock # 0,049 CPUs utilized 16.540 context-switches # 0,002 M/sec 19 CPU-migrations # 0,000 M/sec 3.377 page-faults # 0,000 M/sec 8.544.770.425 cycles # 0,790 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8.038.105.826 instructions # 0,94 insns per cycle 1.302.988.705 branches # 120,451 M/sec 41.415.237 branch-misses # 3,18% of all branches 222,249003389 seconds time elapsedWell, how do you come from 12 seconds to 222? Easy, you actually play in realtime.
MPlayer SVN-r34365-4.6.1 (C) 2000-2011 MPlayer Team Playing pcm:file=/dev/null. File not found: 'pcm:file=/dev/null' Failed to open pcm:file=/dev/null. Playing /dev/shm/convergence_-_points_of_view/03 - Listen.mp3.This looks rather corrupt from the beginning. The command line is always the same, the interpretation differs.
Valgrind is no help with my MPlayer build:
valgrind: m_debuginfo/readdwarf.c:2162 (copy_convert_CfiExpr_tree): Assertion 'srcix >= 0 && srcix < VG_(sizeXA)(srcxa)' failed.Well, I'm not here for debugging ffmp3float. Perhaps that's a bug fixed since I did my checkout of MPlayer / ffmpeg.
Let me summarize the results. Where did I start, where did I end up? This is the approximate picture here, inside MPlayer:
run | user time K6-III+ / s | user time Duron / s | ratio K6 / Duron |
run | new (old) | new (old) / s | new (old) |
mplayer-mp3lib | 56.5 (56.5) | 26.5 (25.4) | 2.13 (2.24) |
mplayer-mpg123 | 53.9 (56.5) | 27.0 (28.7) | 2.00 (2.33) |
Let's automate the whole thing. I need to do builds of MPlayer with all three variants of ad_mpg123
and both libmpg123 versions, the latter static, too.
To reiterate, the variants are:
I'll use extra CFLAGS of '-march=k6-3 -mtune=k6-3' for MPlayer, '-march=k6-3 -mtune=k6-3 -O3' for mpg123. For the latter, I have to see if the builtin default set of flags is really a good idea. Has -funroll-all-loops
proven to be good? Oh, I see something. I wonder if omitting frame pointers interferes with debugging on i386 and explains the difficulties I had in the profiles? But for plain MPlayer that worked fine. Ah
To be fair, let's use the same set of CFLAGS for the programs, MPlayer does this: -O4 -march=native -mtune=native -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize
, I add -march=k6-3 -mtune=k6-3
to make sure the build works on the K6-3 (I'll change that to athlon for the Duron). Let's use the same flags for both:
export CFLAGS="-O4 -march=native -mtune=native -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize" cd mplayer && ./configure && make && cd .. cd mpg123 && ./configure --with-optimization=0 && make
Well, here is a full script to collect some reliable data on the two machines:
#!/bin/sh # You check out MPlayer and mpg123 sources before running this script. mplayersrc=$HOME/mplayer mpg123src=$HOME/mpg123 # number of runs for each configuration, # to be used for statistics in postprocessing runs=20 # also place the old and new variant of ad_mpg123 in there ad_mpg123=$mplayersrc/libmpcodecs/ad_mpg123.c if test -f $ad_mpg123.old && test -f $ad_mpg123.new; then echo "good, ad_mpg123 sources exist" else echo "missing ad_mpg123 variants" exit 1 fi host=`hostname` if test $host = duron; then arch=athlon elif test $host = k6-3; then arch=k6-3 else arch=native fi export CFLAGS="-O4 -march=$arch -mtune=$arch -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize" mpg123prefix=$HOME/mpg123-$arch # prepare builds rm -rf $mpg123prefix && cd $mpg123src && make clean && ./configure --prefix=$mpg123prefix --with-optimization=0 --enable-static && make && make install && cd $mplayersrc && make clean && # prepare a first build, then hack around the libmpg123 building/linking ./configure && make && # add possible CPPFLAGS, like -I$mpg123prefix/include build_ad_mpg123() { cc -MD -MP -Wundef -Wall -Wno-switch -Wno-parentheses \ -Wpointer-arith -Wredundant-decls -Wstrict-prototypes -Wmissing-prototypes \ -Wdisabled-optimization -Wno-pointer-sign -Wdeclaration-after-statement -std=gnu99 \ -Werror-implicit-function-declaration \ $CFLAGS "$@" \ -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE \ -Ilibdvdread4 -I. -Iffmpeg -D_REENTRANT \ -c -o libmpcodecs/ad_mpg123.o libmpcodecs/ad_mpg123.c } # only argument: the part that links mpg123 # so, -lmpg123 in the simplest case link_mplayer() { # Why is there -ffast-math in the linker command? cc -o mplayer command.o m_property.o mixer.o mp_fifo.o \ mplayer.o parser-mpcmd.o pnm_loader.o input/input.o \ libao2/ao_mpegpes.o libao2/ao_null.o libao2/ao_pcm.o \ libao2/audio_out.o libvo/aspect.o libvo/geometry.o \ libvo/video_out.o libvo/vo_mpegpes.o libvo/vo_null.o \ sub/spuenc.o libao2/ao_alsa.o input/appleir.o libvo/vo_fbdev.o \ libvo/vo_fbdev2.o libvo/vo_png.o libvo/vo_md5sum.o \ udp_sync.o libao2/ao_oss.o libvo/vo_pnm.o libvo/vo_tga.o \ libvo/vo_v4l2.o libao2/ao_v4l2.o libvo/vo_cvidix.o \ libvo/vosub_vidix.o vidix/vidix.o vidix/drivers.o vidix/dha.o \ vidix/mtrr.o vidix/pci.o vidix/pci_names.o vidix/pci_dev_ids.o \ vidix/cyberblade_vid.o vidix/mach64_vid.o vidix/mga_vid.o \ vidix/mga_crtc2_vid.o vidix/nvidia_vid.o vidix/pm2_vid.o \ vidix/pm3_vid.o vidix/radeon_vid.o vidix/rage128_vid.o \ vidix/s3_vid.o vidix/sis_vid.o vidix/sis_bridge.o \ vidix/unichrome_vid.o libvo/vo_yuv4mpeg.o asxparser.o \ bstr.o codec-cfg.o cpudetect.o edl.o fmt-conversion.o \ m_config.o m_option.o m_struct.o mp_msg.o mp_strings.o \ mpcommon.o parser-cfg.o path.o playtree.o playtreeparser.o \ subopt-helper.o libaf/af.o libaf/af_center.o libaf/af_channels.o \ libaf/af_comp.o libaf/af_delay.o libaf/af_dummy.o libaf/af_equalizer.o \ libaf/af_extrastereo.o libaf/af_format.o libaf/af_gate.o libaf/af_hrtf.o \ libaf/af_karaoke.o libaf/af_pan.o libaf/af_resample.o libaf/af_scaletempo.o \ libaf/af_sinesuppress.o libaf/af_stats.o libaf/af_sub.o libaf/af_surround.o \ libaf/af_sweep.o libaf/af_tools.o libaf/af_volnorm.o libaf/af_volume.o \ libaf/filter.o libaf/format.o libaf/reorder_ch.o libaf/window.o \ libmpcodecs/ad.o libmpcodecs/ad_alaw.o libmpcodecs/ad_dk3adpcm.o \ libmpcodecs/ad_dvdpcm.o libmpcodecs/ad_hwac3.o libmpcodecs/ad_hwmpa.o \ libmpcodecs/ad_imaadpcm.o libmpcodecs/ad_msadpcm.o libmpcodecs/ad_pcm.o \ libmpcodecs/dec_audio.o libmpcodecs/dec_teletext.o libmpcodecs/dec_video.o \ libmpcodecs/img_format.o libmpcodecs/mp_image.o \ libmpcodecs/pullup.o libmpcodecs/vd.o libmpcodecs/vd_hmblck.o \ libmpcodecs/vd_lzo.o libmpcodecs/vd_mpegpes.o libmpcodecs/vd_mtga.o \ libmpcodecs/vd_null.o libmpcodecs/vd_raw.o libmpcodecs/vd_sgi.o \ libmpcodecs/vf.o libmpcodecs/vf_1bpp.o libmpcodecs/vf_2xsai.o \ libmpcodecs/vf_blackframe.o libmpcodecs/vf_boxblur.o libmpcodecs/vf_crop.o \ libmpcodecs/vf_cropdetect.o libmpcodecs/vf_decimate.o \ libmpcodecs/vf_delogo.o libmpcodecs/vf_denoise3d.o libmpcodecs/vf_detc.o \ libmpcodecs/vf_dint.o libmpcodecs/vf_divtc.o libmpcodecs/vf_down3dright.o \ libmpcodecs/vf_dsize.o libmpcodecs/vf_dvbscale.o \ libmpcodecs/vf_eq.o libmpcodecs/vf_eq2.o libmpcodecs/vf_expand.o \ libmpcodecs/vf_field.o libmpcodecs/vf_fil.o libmpcodecs/vf_filmdint.o \ libmpcodecs/vf_fixpts.o libmpcodecs/vf_flip.o libmpcodecs/vf_format.o \ libmpcodecs/vf_framestep.o libmpcodecs/vf_gradfun.o \ libmpcodecs/vf_halfpack.o libmpcodecs/vf_harddup.o \ libmpcodecs/vf_hqdn3d.o libmpcodecs/vf_hue.o libmpcodecs/vf_il.o \ libmpcodecs/vf_ilpack.o libmpcodecs/vf_ivtc.o libmpcodecs/vf_kerndeint.o \ libmpcodecs/vf_mirror.o libmpcodecs/vf_noformat.o libmpcodecs/vf_noise.o \ libmpcodecs/vf_ow.o libmpcodecs/vf_palette.o libmpcodecs/vf_perspective.o \ libmpcodecs/vf_phase.o libmpcodecs/vf_pp7.o libmpcodecs/vf_pullup.o \ libmpcodecs/vf_rectangle.o libmpcodecs/vf_remove_logo.o \ libmpcodecs/vf_rgbtest.o libmpcodecs/vf_rotate.o libmpcodecs/vf_sab.o \ libmpcodecs/vf_scale.o libmpcodecs/vf_smartblur.o \ libmpcodecs/vf_softpulldown.o libmpcodecs/vf_stereo3d.o \ libmpcodecs/vf_softskip.o libmpcodecs/vf_swapuv.o \ libmpcodecs/vf_telecine.o libmpcodecs/vf_test.o libmpcodecs/vf_tfields.o \ libmpcodecs/vf_tile.o libmpcodecs/vf_tinterlace.o libmpcodecs/vf_unsharp.o \ libmpcodecs/vf_vo.o libmpcodecs/vf_yadif.o libmpcodecs/vf_yuvcsp.o \ libmpcodecs/vf_yvu9.o libmpdemux/aac_hdr.o libmpdemux/asfheader.o \ libmpdemux/aviheader.o libmpdemux/aviprint.o libmpdemux/demuxer.o \ libmpdemux/demux_aac.o libmpdemux/demux_asf.o libmpdemux/demux_audio.o \ libmpdemux/demux_avi.o libmpdemux/demux_demuxers.o \ libmpdemux/demux_film.o libmpdemux/demux_fli.o libmpdemux/demux_lmlm4.o \ libmpdemux/demux_mf.o libmpdemux/demux_mkv.o libmpdemux/demux_mov.o \ libmpdemux/demux_mpg.o libmpdemux/demux_nsv.o libmpdemux/demux_pva.o \ libmpdemux/demux_rawaudio.o libmpdemux/demux_rawvideo.o \ libmpdemux/demux_realaud.o libmpdemux/demux_real.o libmpdemux/demux_roq.o \ libmpdemux/demux_smjpeg.o libmpdemux/demux_ts.o libmpdemux/demux_ty.o \ libmpdemux/demux_ty_osd.o libmpdemux/demux_viv.o libmpdemux/demux_vqf.o \ libmpdemux/demux_y4m.o libmpdemux/ebml.o libmpdemux/extension.o \ libmpdemux/mf.o libmpdemux/mp3_hdr.o libmpdemux/mp_taglists.o \ libmpdemux/mpeg_hdr.o libmpdemux/mpeg_packetizer.o libmpdemux/parse_es.o \ libmpdemux/parse_mp4.o libmpdemux/video.o libmpdemux/yuv4mpeg.o \ libmpdemux/yuv4mpeg_ratio.o osdep/getch2.o osdep/timer-linux.o \ stream/open.o stream/stream.o stream/stream_bd.o stream/stream_cue.o \ stream/stream_file.o stream/stream_mf.o stream/stream_null.o \ stream/url.o sub/eosd.o sub/find_sub.o sub/osd.o sub/spudec.o \ sub/sub.o sub/sub_cc.o sub/subreader.o sub/vobsub.o stream/ai_alsa.o \ stream/ai_oss.o sub/font_load.o stream/dvb_tune.o stream/stream_dvb.o \ stream/stream_dvdnav.o libdvdnav/dvdnav.o libdvdnav/highlight.o \ libdvdnav/navigation.o libdvdnav/read_cache.o libdvdnav/remap.o \ libdvdnav/searching.o libdvdnav/settings.o libdvdnav/vm/decoder.o \ libdvdnav/vm/vm.o libdvdnav/vm/vmcmd.o stream/stream_dvd.o \ stream/stream_dvd_common.o libdvdread4/bitreader.o libdvdread4/dvd_input.o \ libdvdread4/dvd_reader.o libdvdread4/dvd_udf.o libdvdread4/ifo_print.o \ libdvdread4/ifo_read.o libdvdread4/md5.o libdvdread4/nav_print.o \ libdvdread4/nav_read.o libvo/aclib.o av_helpers.o av_opts.o \ libaf/af_lavcresample.o libmpcodecs/ad_ffmpeg.o libmpcodecs/ad_spdif.o \ libmpcodecs/vd_ffmpeg.o libmpcodecs/vf_geq.o libmpcodecs/vf_lavc.o \ libmpcodecs/vf_lavcdeint.o libmpcodecs/vf_pp.o libmpcodecs/vf_screenshot.o \ libmpdemux/demux_lavf.o stream/stream_ffmpeg.o sub/av_sub.o \ libaf/af_lavcac3enc.o libmpcodecs/vf_fspp.o libmpcodecs/vf_mcdeint.o \ libmpcodecs/vf_qp.o libmpcodecs/vf_spp.o libmpcodecs/vf_uspp.o \ stream/stream_ftp.o libmpcodecs/vf_bmovl.o libaf/af_export.o \ osdep/mmap_anon.o libdvdcss/css.o libdvdcss/device.o libdvdcss/error.o \ libdvdcss/ioctl.o libdvdcss/libdvdcss.o libmpcodecs/vd_libmpeg2.o \ libmpeg2/alloc.o libmpeg2/cpu_accel.o libmpeg2/cpu_state.o \ libmpeg2/decode.o libmpeg2/header.o libmpeg2/idct.o libmpeg2/motion_comp.o \ libmpeg2/slice.o libmpeg2/idct_mmx.o libmpeg2/motion_comp_mmx.o \ libmpcodecs/ad_mpg123.o libmpcodecs/ad_mp3lib.o mp3lib/sr1.o \ mp3lib/decode_i586.o mp3lib/dct36_3dnow.o mp3lib/dct64_3dnow.o \ mp3lib/dct36_k7.o mp3lib/dct64_k7.o mp3lib/dct64_mmx.o mp3lib/decode_mmx.o \ stream/stream_rtsp.o stream/freesdp/common.o stream/freesdp/errorlist.o \ stream/freesdp/parser.o stream/librtsp/rtsp.o stream/librtsp/rtsp_rtp.o \ stream/librtsp/rtsp_session.o stream/stream_netstream.o \ stream/asf_mmst_streaming.o stream/asf_streaming.o stream/cookies.o \ stream/http.o stream/network.o stream/pnm.o stream/rtp.o \ stream/udp.o stream/tcp.o stream/stream_rtp.o stream/stream_udp.o \ stream/realrtsp/asmrp.o stream/realrtsp/real.o stream/realrtsp/rmff.o \ stream/realrtsp/sdpplin.o stream/realrtsp/xbuffer.o stream/stream_pvr.o \ libmpcodecs/ad_qtaudio.o libmpcodecs/vd_qtvideo.o libmpcodecs/ad_realaud.o \ libmpcodecs/vd_realvid.o stream/cache2.o tremor/bitwise.o tremor/block.o \ tremor/codebook.o tremor/floor0.o tremor/floor1.o tremor/framing.o \ tremor/info.o tremor/mapping0.o tremor/mdct.o tremor/registry.o \ tremor/res012.o tremor/sharedbook.o tremor/synthesis.o tremor/window.o \ stream/stream_tv.o stream/tv.o stream/frequencies.o stream/tvi_dummy.o \ stream/tvi_v4l.o stream/audio_in.o stream/tvi_v4l2.o sub/unrar_exec.o \ stream/stream_vcd.o libmpcodecs/ad_libvorbis.o libmpdemux/demux_ogg.o \ loader/wrapper.o loader/elfdll.o loader/ext.o loader/ldt_keeper.o \ loader/module.o loader/pe_image.o loader/pe_resource.o loader/registry.o \ loader/resource.o loader/win32.o libmpcodecs/ad_acm.o libmpcodecs/ad_dmo.o \ libmpcodecs/ad_dshow.o libmpcodecs/ad_twin.o libmpcodecs/vd_dmo.o \ libmpcodecs/vd_dshow.o libmpcodecs/vd_vfw.o libmpcodecs/vd_vfwex.o \ libmpdemux/demux_avs.o loader/afl.o loader/drv.o loader/vfl.o \ loader/dshow/DS_AudioDecoder.o loader/dshow/DS_Filter.o \ loader/dshow/DS_VideoDecoder.o loader/dshow/allocator.o \ loader/dshow/cmediasample.o loader/dshow/graph.o loader/dshow/guids.o \ loader/dshow/inputpin.o loader/dshow/mediatype.o loader/dshow/outputpin.o \ loader/dmo/DMO_AudioDecoder.o loader/dmo/DMO_VideoDecoder.o \ loader/dmo/buffer.o loader/dmo/dmo.o loader/dmo/dmo_guids.o \ libmpcodecs/vd_xanim.o osdep/shmem.o \ ffmpeg/libpostproc/libpostproc.a ffmpeg/libswscale/libswscale.a \ ffmpeg/libavfilter/libavfilter.a ffmpeg/libavformat/libavformat.a \ ffmpeg/libavcodec/libavcodec.a ffmpeg/libavutil/libavutil.a \ "$@" -Wl,-z,noexecstack -lm -ffast-math \ -lncurses -lasound -ldl -lpthread -lpthread -ldl -rdynamic } # Be careful with the copying! # Debian has messed up scripting so that the x bit normally is not copied! # That and the strange format of `time` output when called in a script, is, strange. cp -av $ad_mpg123.old $ad_mpg123 && build_ad_mpg123 && link_mplayer -lmpg123 && cp -av mplayer mplayer-variant0-dynamic-sys && link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 && cp -av mplayer mplayer-variant0-dynamic-trunk && link_mplayer $mpg123prefix/lib/libmpg123.a && cp -av mplayer mplayer-variant0-static-trunk && cp -av $ad_mpg123.new $ad_mpg123 && build_ad_mpg123 && link_mplayer -lmpg123 && cp -av mplayer mplayer-variant1-dynamic-sys && link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 && cp -av mplayer mplayer-variant1-dynamic-trunk && link_mplayer $mpg123prefix/lib/libmpg123.a && cp -av mplayer mplayer-variant1-static-trunk && cp -av $ad_mpg123.new $ad_mpg123 && build_ad_mpg123 -I$mpg123prefix/include && # also variant 2 with old lib, not recommened, # but should still work (or at least fail gracefully) link_mplayer -lmpg123 && cp -av mplayer mplayer-variant2-dynamic-sys && link_mplayer -Wl,-rpath=$mpg123prefix/lib -lmpg123 && cp -av mplayer mplayer-variant2-dynamic-trunk && link_mplayer $mpg123prefix/lib/libmpg123.a && cp -av mplayer mplayer-variant2-static-trunk || exit 1 time_run() { time "$@" } mark_run() { echo echo "=== This was run $1 of $2 ===" echo } album=/dev/shm/convergence_-_points_of_view # some runs for mpg123 to set the baseline for d in mpg123 $mpg123prefix/bin/mpg123 do for n in `seq 1 20` do time_run $d -w /dev/null $album/*.mp3 mark_run $n $d done done # now the mplayers for d in $mplayersrc/mplayer-variant* do for ac in mp3 mpg123 do for n in `seq 1 20` do time_run $d -ac $ac -quiet -ao pcm:fast:file=/dev/null $album/*.mp3 mark_run $n $d:$ac done done doneI'll let this run on the two machines over the day...
This is the result for user times... arithmetic mean over 20 runs each, with being the standard deviation (rounded just like the mean, so there can be full zeroes, one should always wiggle the last digit, though). The mplayer builds are labelled with the ad_mpg123 variant number, a ``d'' for dynamic, ``s'' for static linking to libmpg123, followed by ``s'' for system libmpg123 (a bit older, usually) or ``t'' for a checkout of mpg123 svn trunk, built with optimized flags.
K6-III+@550 | Duron@800 | C2@800 | Semp@800 | |||||
run | ts | s | ts | s | ts | s | ts | s |
mpg123-sys | 51.5 | 0.3 | 22.5 | 0.1 | 15.5 | 0.0 | 14.8 | 0.0 |
mpg123-trunk | 50.6 | 0.3 | 22.4 | 0.1 | 15.4 | 0.0 | 14.8 | 0.0 |
mpl0ds-mp3lib | 56.0 | 0.4 | 25.5 | 0.1 | 18.6 | 0.1 | 19.3 | 2.0 |
mpl0ds-mpg123 | 56.0 | 0.3 | 26.1 | 0.1 | 16.5 | 0.1 | 15.6 | 0.1 |
mpl0dt-mp3lib | 56.2 | 0.4 | 25.5 | 0.1 | 18.6 | 0.0 | 19.3 | 2.0 |
mpl0dt-mpg123 | 55.1 | 0.3 | 25.8 | 0.1 | 16.3 | 0.0 | 15.5 | 0.1 |
mpl0st-mp3lib | 56.5 | 0.3 | 25.6 | 0.2 | 18.6 | 0.0 | 19.3 | 1.8 |
mpl0st-mpg123 | 55.0 | 0.2 | 25.5 | 0.1 | 16.3 | 0.0 | 15.5 | 0.1 |
mpl1ds-mp3lib | 56.2 | 0.4 | 25.5 | 0.1 | 18.6 | 0.1 | 20.5 | 1.8 |
mpl1ds-mpg123 | 57.2 | 0.3 | 26.9 | 0.1 | 16.4 | 0.0 | 15.7 | 0.1 |
mpl1dt-mp3lib | 56.2 | 0.3 | 25.5 | 0.1 | 18.6 | 0.0 | 20.3 | 1.9 |
mpl1dt-mpg123 | 56.2 | 0.3 | 26.8 | 0.1 | 16.3 | 0.0 | 15.7 | 0.1 |
mpl1st-mp3lib | 56.6 | 0.3 | 25.6 | 0.2 | 18.6 | 0.0 | 19.6 | 1.8 |
mpl1st-mpg123 | 56.3 | 0.3 | 26.6 | 0.1 | 16.3 | 0.0 | 15.7 | 0.1 |
mpl2dt-mp3lib | 56.4 | 0.4 | 25.5 | 0.1 | 18.3 | 0.0 | 19.2 | 1.9 |
mpl2dt-mpg123 | 54.5 | 0.3 | 26.1 | 0.1 | 16.0 | 0.0 | 15.4 | 0.1 |
mpl2st-mp3lib | 56.6 | 0.5 | 25.6 | 0.1 | 18.3 | 0.0 | 19.8 | 1.9 |
mpl2st-mpg123 | 54.5 | 0.3 | 25.9 | 0.0 | 16.0 | 0.0 | 15.5 | 0.1 |
best mp3lib | 56.0 | 0.4 | 25.5 | 0.1 | 18.3 | 0.0 | 19.2 | 1.9 |
best mpg123 | 54.5 | 0.3 | 25.5 | 0.1 | 16.0 | 0.0 | 15.4 | 0.1 |
best variant | mpl2dt | mpl0st | mpl2dt | mpl2dt |
It's interesting that the Duron/Athlon is capable of keeping the Core2Duo in check; probably assisted by that faster RAM, though I do wonder how much effect that has since the relevant code should nicely fit into the cache. But the audio data won't, of course. And the Core2Duo has more cache. Whatever. Both machines are in 64 bit mode and are using the 64 bit SSE code in mpg123. And not to forget that this is the inverse picture for mp3lib: That is significantly slower, the spread on Sempron meaning that mp3lib can be about 33% slower than the mpg123 binding!That puts the idea about faster RAM to the test, eh?
Also, the numbers in this table are noticeably different from the initial measurements. I attribute this to the statistics, and changed CFLAGS... in any case, there is really quite some variation of individual values going on. One might wonder why (at least for the Sempron?) there are rather clear influences of built type on mp3lib performance. The mp3lib code has not been touched.
Well, I did change something going from 0 to 1: The decoding loop is different to support the frame decoding. That difference amounts to about a whole second runtime difference, based on around 26 seconds total time. This is about 4%. Heck, this is not limited to the Duron -- look closer! This means also almost 2% hit on the K6-III+ and also 1% on the Sempron. The Core2Duo doesn't complain at all.
Changing the decoding loop order doesn't do the trick. This must be a strange side-effect. Really, the code logic (when switching the loop around) is the same. Just the #ifdefs around are different.
Really, it doesn't make sense. The calls to libmpg123 are the same, the order of feeding vs. decoding in the loop doesn't make a difference... Hacking the variant 1 feeding decode to also return decoded data (reducing calls to mpg123_decode) worsens the situation. I'm lost here and also fed up. This is insanity. The next gcc might reversen the situation. So, for now, just use fresh MPlayer with fresh mpg123 to get variant 2 and be done with the small performance hit in that specific configuration. Or, if you lean that way, get on with debugging what the heck the compiler is doing there, and also make sure you find the reason why mp3lib tends to have different modes of operation with differing variance of runtime depending on what ad_mpg123 code is built (easy to spot in the Sempron runs).
One can still argue that there is something sub-optimal with our 3DNowExt assembly (although it works wonderfully for the K6-III+), specific to the Duron's small cache. But then, profiling does not suggest that there is excess time spent in that code. Would have to be a side effect making other code slower because of cache trashing. If that's the case, however, this effect should be smaller on a proper Athlon (Thunderbird, at least). It could be that this is the reason for the effect being half as big for the K6-III+, aleady. But the Sempron does not employ the 3DNowExt code and also has not exactly a small cache. Though, I admit that the size of impact indeed does seem to correlate with L2 cache size: We got 64K, 256K, 1M, 3M. Still, it's apples against oranges with the differing platforms (toolchain).
But if you need 3M of cache to cover the hot code / data, then I'd conclude that MPlayer is rather fat there. I guess that switching off the MPEG stream parsing there could help, as that code is doubled in libmpg123, too. I also keep noticing routines checking for AC3 stuff in MPlayer. I dunno why they are needed here. There is too much code running.
And my mysery is a bit offset by another way at looking at mp3lib: Comparing the Core2Duo with the Sempron we have plain mpg123 runtime of 15.4 vs. 14.8 s, Sempron is 0.6 s faster. From within MPlayer, using dynamic trunk libmpg123, it is rather consistently around 0.6 s faster, too. So, this all seams reasonable, once you accept that the Sempron has a better time with the x86-64 assemblly code.
With mp3lib, there is a spread from 18.3 to 18.6 s on the Core2Duo, and on the Sempron from 20.8 to 21.6; the direct comparison of runs shows 2.3 to 3 s increase in runtime. Now explain why the Sempron has a harder time with MPlayer's 64 bit SSE assembly code, a time harder by 5%; and comparing that to mpg123 on the same box, it's 33%, even. With that in mind, 4% unexplained performance hit on the Duron is not that bad, is it?
The final set of runs presented here has rather consistent performance of mp3lib on the Sempron, but with really large spread around the mean. Earlier runs had somewhat smaller error range but higher mean runtime. There is definetely more strangeness going on with performance in mp3lib, and perhaps by extension generally with the (my?) MPlayer binaries, than with the libmpg123 binding. 'nuff said.
Well, when you install a new MPlayer and new mpg123, you can expect an increase in MP3 decoding performance. There are certain platforms that defy logic, or at least bring out the utmost fuzzyness out of it. That fuzzyness is embodied somehow in the build of MPlayer, changing efficiency of parts of code all over the place without directly touching them. The build on the Duron gives, scaled relative to plain mpg123 performance, compared to K6-III+, both mp3lib and mpg123 an extra hit, but delivers a bigger blow to mpg123. The situation is on full reverse, sort of, on a Sempron.
That strangeness in mind, I hereby declare that I want to keep what's left of my sanity and also that the new variant of ad_mpg123 should go into MPlayer's trunk and is rightfully obsoleting mp3lib. If someone opposes that, that someone should first give me a consistent picture of what exactly the mp3lib performance is;-)
Now, after recognizing that current ffmpeg indeed improved the mp3float decoder to surpass mpg123 on a Core2 CPU, with the help of SSE handcrafting and, more importantly, an approach to Huffman decoding that fits that CPU a lot better, I checked the situation again on the Duron.
There, the bare comparison between ffmpeg and mpg123, when both decode to floating point numbers, showed two things:
With that latest change, the picture finally fits on the Duron. I re-run some benchmarks with updated MPlayer checkout, usage of ffmp3float is also included, but note that the comparison is actually skewed in favour of ffmpeg as it does not do the final conversion to 16 bits (it would need a bit more time with that).
I repeat the big table for the current ad_mpg123 in MPlayer (leaving only variants 1 and 2), updating the columns for the Duron with the new numbers. Note that I assume, but do not know, that the K6-III+ would react similar to the Duron. I'll have to run the updated test there, too... some time.
K6-III+@550 | Duron@800 new | C2@800 | Semp@800 | |||||
run | ts | s | ts | s | ts | s | ts | s |
mpg123-sys | 51.5 | 0.3 | 22.5 | 0.1 | 15.5 | 0.0 | 14.8 | 0.0 |
mpg123-trunk | 50.6 | 0.3 | 21.3 | 0.1 | 15.4 | 0.0 | 14.8 | 0.0 |
mpl1ds-mp3lib | 56.2 | 0.4 | 25.5 | 0.1 | 18.6 | 0.1 | 20.5 | 1.8 |
mpl1ds-mpg123 | 57.2 | 0.3 | 27.0 | 0.1 | 16.4 | 0.0 | 15.7 | 0.1 |
mpl1ds-ffmp3float | 29.5 | 0.1 | ||||||
mpl1dt-mp3lib | 56.2 | 0.3 | 25.5 | 0.1 | 18.6 | 0.0 | 20.3 | 1.9 |
mpl1dt-mpg123 | 56.2 | 0.3 | 25.6 | 0.1 | 16.3 | 0.0 | 15.7 | 0.1 |
mpl1dt-ffmp3float | 29.7 | 0.2 | ||||||
mpl1st-mp3lib | 56.6 | 0.3 | 25.5 | 0.1 | 18.6 | 0.0 | 19.6 | 1.8 |
mpl1st-mpg123 | 56.3 | 0.3 | 25.5 | 0.1 | 16.3 | 0.0 | 15.7 | 0.1 |
mpl1st-ffmp3float | 29.7 | 0.2 | ||||||
mpl2dt-mp3lib | 56.4 | 0.4 | 25.5 | 0.1 | 18.3 | 0.0 | 19.2 | 1.9 |
mpl2dt-mpg123 | 54.5 | 0.3 | 24.9 | 0.1 | 16.0 | 0.0 | 15.4 | 0.1 |
mpl2dt-ffmp3float | 29.6 | 0.1 | ||||||
mpl2st-mp3lib | 56.6 | 0.5 | 25.5 | 0.1 | 18.3 | 0.0 | 19.8 | 1.9 |
mpl2st-mpg123 | 54.5 | 0.3 | 24.7 | 0.1 | 16.0 | 0.0 | 15.5 | 0.1 |
mpl2st-ffmp3float | 29.7 | 0.1 | ||||||
best mp3lib | 56.0 | 0.4 | 25.5 | 0.1 | 18.3 | 0.0 | 19.2 | 1.9 |
best mpg123 | 54.5 | 0.3 | 24.7 | 0.1 | 16.0 | 0.0 | 15.4 | 0.1 |
best variant | mpl2dt | mpl2st | mpl2dt | mpl2dt |
So, with an upcoming mpg123 release, mp3lib is clearly beaten. The situation with ffmp3float is also under control on non-SSE machines (or generally non-Core2?). Investigations are underway to get mpg123 up to speed on modern CPUs, at least on par with ffmpeg.
Also, that the hand-crafted DCT routines are overtaken by the compiler's own optimizing is actually a good thing. Of course, I'll have to check if this also applies to the K6-III+ and newer Athlons. If that is the case, we can delete old code for good. On the other hand: There might be potential for even better routines using 3DNow; question is if we care enough for that now obsoleted instruction set... or if we want to go freakish and use 3DNow and SSE at the same time for maximum register count.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -local_icons beating_mp3lib_in_mplayer.tex
The translation was initiated by on 2012-03-25