Very frequently, you read benchmark claims for various languages. Those claims have lots of problems. First, they usually are a “Hello, world!” program, and not complicated business logic. Second, they usually compare optimized code in X vs. poorly written Y.
I decided to write a “Video Poker Trainer” program. In video poker gambling, you are dealt 5 cards. You choose what cards to hold, and then draw new cards. Based on your final hand, you receive a payout.
Some video poker games have a payout close to 100%, but the strategy is complicated. The casino advantage comes from the fact that many people make suboptimal plays. Therefore, it would be useful to have a program that helps you practice.
My program solves the video poker hand by checking all the possibilities. There are 2^5=32 possibilities for what you hold, and then 47_choose_n possibilities for draws, where n is the number of cards you draw. I wasn’t intending to do this as a benchmark test, but when the performance of my first version was unacceptable, I wound up rewriting it in several languages.
All of this code is running on my desktop PC.
I then tried Chrome (V8). There, it took 1.6 seconds! That’s a substantial difference compared to Firefox!
However, 1.6 seconds is still unacceptable for a game. The browser window would freeze while the calculation ran. I could sprinkle SetTimeout during the calculation, but that would make it take even longer. Also, if it’s a browser game, I can’t assume that the client is using Chrome.
I could have tried HVVM (Facebook’s version of optimized PHP), but didn’t.
My final try was C/C++. I expected that C would clobber everything, including Chrome/V8. Again, it was the exact same algorithm, changing the PHP syntax to C/C++ syntax.
I was surprised that the C/C++ version took 1.6 seconds, the same as Chrome/V8! I was pretty shocked by that.
I probably could go back and tune the code, but to be fair, then I’d have to do it in all 3 languages. To make the comparison perfectly accurate, it should be the exact same algorithm in each language.
It wouldn’t be a proper experiment if I got the answer I expected. I thought that the C version would clobber everything, including the Chrome/V8 version. Instead, C and Chrome came out tied.
Oops, just as I finished this post, I realized I made a mistake. I was using gcc WITH THE DEFAULT COMPILE OPTIONS! I tried again with -O3 full optimization flags! Now, the C version takes 0.6 seconds, coming out ahead of V8, like it should. My confidence in C is restored. (For completeness, -O1 took 0.71 seconds, -O2 0.61 seconds, -O3 0.60 seconds.)
Another improvement, “gcc -O3 -ffast-math” took 0.57 seconds! (I don’t mind an error in the 7th decimal place, since this is a game, so -ffast-math is a reasonable compiler optimization.)
The benchmark comparison was on my desktop. Moving to my Linode (using g++ instead of gcc), the unoptimized version took 1.3 seconds per hand. The -O3 version took 0.36 seconds per hand! (Actually, because the Linode is shared, the time it takes to run depends on how much the other users are using the server. I got a range of 0.36 seconds per hand to 0.5 seconds per hand.)
My initial conclusion, as I was composing this post, was that Chrome/V8 and C had equal performance. However, I was making a big mistake! I wasn’t turning on optimization settings in gcc! That shows how a poorly conducted benchmark test can lead to false conclusions. I almost made that mistake myself! (For Chrome/V8, I was using the default settings. I don’t know anything about extra optimization settings in Chrome/V8.)
The game still isn’t finished. I just wrote the code that does the calculations. I’ll put it up here when I”m done. It looks like only way to get acceptable performance is to shell_exec call the C program that does the calculation, and then serve that to the client. If I’m just calling an external C program, it doesn’t matter what language I use on the server, but I’m going to stick with PHP because that’s what I like the best.
It’s also educational to write the same algorithm in multiple languages. By the time I got around to the C/C++ version, I was making sure it was proper reusable code, so I could use the same function to solve multiple video poker variants. I plan for my actual program to work on 9/6 and Double Joker Wild. (Double Joker Wild and 9/6 are two video poker variants. 9/6 has a house edge of only 0.6% played perfectly, and Double Joker Wild has a house edge of 0.09% played perfectly, making a practice program useful. Adding wild cards means I need a completely different hand evaluation function, which is why I had to make that a function pointer.)
If you posted the code in all languages, maybe we could crowdsource optimizations?
Also, are you familiar with ASM.js? It could go a long way in FF. http://blog.mozilla.org/mbest/2013/06/25/asm-js-its-really-fast-backwards-compatible-and-now-in-the-release-version-of-firefox/
Are you really interested? If someone’s interested in doing that, I’ll post it. The code only sort of works. It isn’t polished.
Asm.js still isn’t as fast as a native binary, but it’d be closer.
I do plan on finishing it, but I’ll probably just have the calculation done on the server, with PHP doing a shell_exec to the C program.
It also would be interesting to try other languages, like Java, VB/C#.NET, Ruby, Python, etc. I still say that the C/C++ version would perform best.
OK, I uploaded my code. Here it is.
Just poking around at the C++ I get 250-300ms for each test (depending on targeting x86 v. x64, etc.) or about 8 million calls to evaluate_hand_96 per second (that’s getting called about 2.6 million times). Which seems reasonable to me in raw numbers (this is on a 3.6 GHz i7, Win7x64, Visual Studio 2010), coming in at around 450 clock cycles per evaluate_hand_96. I tried running it through VTune Amplifier but I think the profiling doesn’t give helpful results because the inner loops are so small that sampling isn’t terribly useful. I tried removing the function pointers and converting some logical ops to arithmetic ones (I think I shaved off 50ms, but it could be a result of the optimization settings).
I’ve read the code enough to understand the structure, but I haven’t looked to see why the inner loop is getting so many calls. Assuming that’s really how much work needs to get done, I don’t see an obvious way to tune it up without multithreading, etc.
I haven’t looked at the other code yet (I will over the weekend), but you can contact me at the included email if you like.
1.a. When I want to run it to loop over every hand and make a full strategy table, I can do multiple processes.
2. I think that’s one advantage of C++ with full optimization. It inlines the functions.
Also, Visual Studio, when you compile for release, automatically turns on full optimization for you. With gcc, I had to remember to do that myself on the command line.
4. If you optimize the code in one language then, to make it a fair comparison, it has to be optimized in each language. I thought that the exact same algorithm in each language would be the most valid comparison.
The inner loop gets so many calls because, when you draw 5 cards, it needs to evaluate 47*46*45*44*43/5! possibilities for what you could draw. When you draw 4 cards, it needs to evaluate 47*46*45*44/4! possibilities. That’s a lot of iterations. Drawing all 5 cards is stupid sometimes, like when you have a pair, but for completeness I always check every possibility.
Also, I put the function pointers in, because I want the same code to work with multiple video poker variants. I’m also doing Double Joker Wild, which means a completely different function to evaluate the hand. Other video poker variants give a bonus for certain 4-of-a-kinds, which also would require a completely different evaluation function.
Yah, I figured that sounded like it was about right for the combinatorics, but doing a quick calculation of 47 choose 5 comes up with 1.5e6 not 2.6e6. I figured multi-threading was out for purposes of comparison, and that any optimization for one language should be for all–I just tackled first what I know best.
47_choose_5 is 1.5E6. 47_choose_4 is 1.7e5, but you draw 4 cards 5 times (5 different possibilities for holding one card).
When I do the double joker wild variant, it’ll be a 54 card deck.
Also, I don’t want the code to take shortcuts, by making assumptions such as “always hold a pair”. In some video poker variants (especially with wild cards), a pair is a bad hand to hold. I could take shortcuts by making the code more intelligent, but that would limit the ease of modifying it for various video poker variants.
I wrote a Python implementation at home. Here on a Core 2 Quad 2.4GHz I get a cpp time of about 470 ms per run, and in Python about 60s per run, or over 120x slower. I don’t have profiling tools for Python, and haven’t done any testing for correctness.
My python code is here. I kept as much of the original style as I could. I think your shuffle algorithm isn’t robust in its randomness, but I kept the same algorithm (shouldn’t affect timing anyway).
This should give you another data point at least.
Really? 60 seconds for one hand on Python? That’s pretty awful. I’m glad I didn’t waste time learning Python, especially with the 2.0 vs 3.0 fiasco.
Testing for correctness is easy. Compare the output to the CPP version! You can set the hand manually instead of letting it deal for you.
Actually, my shuffling algorithm is the correct one. Assuming the rng is fair (it isn’t though), it will produce every possible hand with equal probability.
I did not do
for (i is 1 to n)
Swap i with random position between 1 and n
for (i is 1 to n)
Swap i with random position between i and n
Mine is correct.
Re: shuffle algorithm–yes, I read your code wrong, my mistake. Using the % operator on a rand() result can add bias as well.
Re: Python, my version I tested on is ActivePython 22.214.171.124. It should work in 3+ (though I could have used the // operator in 3+). Since I’m unclear on what typically exists in server environments, I left it at 2.7.
You may want to run my code on your platform to get more useful timing numbers.
That is correct, %n on rand is slightly wrong. If I wanted to be perfect, I should exclude the result when rand returns between MAXINT and floor(MAXINT/n)*n. It is a negligible error.
There are other issues with using rand(), such as the fact that it will repeat eventually. There are 52! possible ways to shuffle a deck, but rand() repeats before 52! hands are dealt. It is good enough if your goal is to deal sample practice hands.
60 seconds was convincing enough to me. 470ms per hand for the C++ version is close to the result on my PC, about 10% faster. I thought Python would do a lot better.
What I am convinced is that nothing beats C/C++ (other than raw assembly). I’ve heard people swear up and down that other languages are faster, but I haven’t seen it in practice.
The other interesting test would be Java or .NET. I expect those would be worse than C/C++ also, but I’m not going to put the time in to try it.
If the process were I/O bound or had high overhead in malloc/free then I’d think .Net might have a shot. I didn’t fully convert everything to “Pythonic” style — getting rid of all globals, for instance, which might optimize better — but I don’t see much more gain.
It really seems to be a compute-bounded case, which is where C++ shines. Even techniques like JIT wouldn’t be expected to help because you’re processing random values, so the predictive branching can’t get better each time you run through a loop.
I’d really be shocked to see anything beat well-written C++ in performance. The advantage of the other languages is primarily development time. I can crank out a Python script to convert data, etc. much faster than C++.
I believe that’s why Rails works for many people–if your business model fits Rails precisely. I genuinely like the Ruby language, but community support seems to favor Python, which is why I adopted it as my primary scripting language (after Perl in the late 90s and Ruby in the mid-2000s).
As far as using rand() goes, yes the cycle is limited, but it’s reasonable to test as proof of concept. If I were to write my own rand() function again I’d use a standard AES256 approach (crypto hashes are excellent pseudo-random # generators) rather than the old Mersenne twister implementation I used in a simulated annealing project from a few years back.
I find it pretty amazing that chrome js seems to be the best approach after C++. Kinda stunning.
If you want to write your own RNG, a better way is to take the XOR of your custom RNG with a standard RNG. If you XOR two independent RNGs, the result should be no less random than either of them. I thought about taking the XOR of several simple linear congruential random number generators with different periods, leading to a RNG with a VERY LONG period. I was looking for 256 bit and 1024 bit random number generators, but couldn’t find anything decent.
Another advantage of writing your own RNG is in certain types of games, if you want to remember the state of the RNG.
The main reason is that, in C/C++, each arithmetic operation (usually) translates to 1 assembly instruction. For those other weakly typed languages, the interpreter does a lot of extra operations verifying the type of the object. Some older languages like FORTRAN and COBOL might outperform the ‘modern’ languages, because they were also designed closer to assembly.
Rails is useful if you’re cranking out a toy demo of a stupid startup website idea. If you’re writing something complicated, which is my specialty, Rails is more of a handicap than a help. A lot of these ‘modern’ fancy frameworks are useless if you try to implement something that isn’t covered by the framework. At the same job, they’re using D3. It’s taking them longer for them to figure out how to make D3 behave the way they want and figuring out the correct D3 inputs, than it would take for them to just open a canvas object and draw on it directly. Seriously, it isn’t that hard to open a canvas object, draw some axes, text, and your chart. Using a 3rd party library is another layer of indirection that makes the website slower.
Last entry for Python. After doing some cursory profiling (using built-in cProfile) I made things more pythonic and made some minor performance tweaks (like using xrange instead of range, which is advised in Python 2.7 but not needed in 3.0) I reduced runtime from 60 to 40 sec.
What’s a link to the updated version?
The profiled result is here.
I’ll take a crack at the Java implementation. It shouldn’t be too different from the C++ code, and I have the Java environment for Android tinkering. No promises on when it’ll be done, but I too would like to see the comparison.
Since everything is compute + branch bound, I expect Java to suck. I guess the question is how much.
Okay, Java is done. Here’s the code.
It fared better than I expected. Runtime is 2.5 seconds. For comparison we have:
C++: 0.56 secs
Java: 2.5 secs
Python: 42.5 secs
So Java is only 5x slower than C++. Thoughts?
5x slower? That really is pretty bad. I’ve had people swear to me Java was at worst 10%-20% slower than C/C++. They were lying and/or stupid, as I suspected.
Of course, that isn’t tuned Java, but I doubt you could tune it enough to overcome the 5x deficit.
I’m really souring on these higher-level languages now that I performed this benchmark test. I always knew it was a little slower, but 5x-40x slower is not acceptable.
Some people will say that for I/O bound operations it doesn’t matter as much. Number-crunching is my specialty, so it certainly matters for the stuff I typically write.
I put your code in the ftp directory (in case you ever take it down from your site).
My comment didn’t come through quite right. I tried to italicize the “only” — as in Java is “only” 5x slower. I did look into profiling the Java code, but I was less than impressed with JVisualVM, and it didn’t prove useful.
Thank you for putting the source with the other examples. If you can, I think you should run all of the samples on the same machine to get better numbers for comparison.
I’m kind of thinking I should put together a C# implementation to compare with .Net. I’d expect it to be on par with Java (possibly slightly better — IIRC it’s better about native types than Java, but I haven’t kept up with details like that in a while).
Your score for C++ is close to mine, so I’m confident in the others. I’ve had bad experiences with Java messing up other things, so I’m not that eager to install the Java SDK on my PC.
5x worse is pretty bad, considering how many people told me Java was “the same” performance as C/C++.
I wonder how older languages would fare, like FOTRAN, COBOL, and Pascal? Those were also written relatively close to the metal, so I think they’d do well.
Someone found a bug in your Java. It leads to a 5x slowdown.
for (int i=0; i!=5; ++i)
int card_rank = (int)(hand[i]%13);
int suit_rank = (int)(hand[i]/13);
Not only does that make the performance worse, it causes it to give the wrong answer. (Three Jacks will come out as a pair of Jacks.)
I should have checked your work. I guess I have to also look at the Python version.
It’s just like at work. They release code without testing it. They get annoyed when I ask “Did you even bother testing that the button you just added actually works when you click on it?”
Well, the point of submitting the code was to increase scrutiny. I didn’t bother with a test harness, or have a golden file to compare against.
I’ll take a look at the Java code when I get a chance, but it’s well off of my urgent list.
I’m going to re-run the test myself. For the C/C++ version, you can call it with a specific hand rather than shuffling, so you can check your answer. That matters because rand() is implemented differently in various languages.
The other language I’m interested in is Go, for a comparison. People say it’s as good as C/C++.
After fixing that missing closing bracket, the Java code runs in 0.97 sec. Much better than I expected–still not parity but less than 2x C++.
I suggest you choose a starting deck for your C++ code, and I’ll push it to Python & Java, and we can compare the results.
I should re-run the test myself. 2x slower is closer to what I expected. Someone on crazyontap said that Java was faster, which I didn’t believe.
The only other language I’m seriously interested in is Go. I’ve heard a lot of hype about it. People are also talking about Rust.
I’ll make another version where the hands are fixed rather than shuffled. That would also make it easier to verify the output.
A 2x performance slowdown isn’t worth all the “nice” things Java does for you. Actually, I found Java annoying.
Publish your code for a comparison like the others did. Otherwise, I donn’t believe you.
Also, did you remember to compile the C++ with full optimization settings?
I meant to try the Java version on my PC as well, but I didn’t really want to pollute my PC with the Java SDK.
“Atrocious” code? It was pretty much a direct copy of the C++ code. I did it because apparently no one else was willing to. The Java fanbois seem loathe to actually benchmark things in public, preferring to talk about theoretical gains from JIT, etc. My hope was that by providing the source for the languages, we could have language experts fix up any problems that I missed (like the looping error I made, when I was tired and really didn’t care much about anything other than finishing the effort).
My conclusions about Java were that I’d forgotten how verbose and picky it is. My conclusions about the Java fanboi clique is that they’re the kind of people I’d rather not spend time with.
I don’t believe your numbers. Provide the code, the settings for your compiler, and then allow someone else to look it over like I did.
That was exactly my point. The guy insisted Java was faster than C++. He did not post his code. I have no idea if he used proper compiler optimization settings when he tested the C++ version.
This is a potentially interesting test, writing code to solve the exact same problem in multiple languages.
I did learn one thing from this exercise, which is that PHP is MUCH worse than I expected. I don’t see why people ridicule me for not knowing that. The only way to be sure is to perform an experiment like I did. I thought that PHP compiled pretty efficiently to C.
That’s another interesting point. The people who favor certain languages (Rails, node.js) have defective personality types. There’s no point learning those languages, because then I’ll be working with the scum who prefer those languages.
I worked on 2 Java projects, both were a mess. That isn’t a sufficient sample size, but it didn’t make me motivated to pursue Java further.
I modified the java code and did some minor optimization:
Fixed the bug
Re-use Random generator instead of creating a new one every time.
Wait till end of processing to output to console (I assume the goal is to measure processing time not measure the time to output to console).
I did not verify the validity of the code outside from the reported bug.
Running the main() method on a JDK 7 build 51 inside my eclipse on a core I7 quad core HT I got 0.255 secs.
Did you run the other versions on your PC for a valid comparison?
I.e., if the Java version is faster on your PC, I need to know if it’s because you have a faster PC or if the code is optimized better.
Also, when I timed my versions, I included all the time (including writing to console). However, writing to console is only milliseconds.
Ran it 10 times on command line using same machine and same JDK.
Results are here
Funny… a flurry of comments for months, then someone gets the Java version faster than the c++ version and comments all stop…
I didn’t have the time to doublecheck the Java version. I doubt it’s true, but I didn’t have time to confirm it myself. I haven’t been putting too much time into my blog lately.
FSK is one funny guy been seeing his posts across 2-3 years dissing java then when the java code is faster he goes to hiding
I haven’t been updating my blog lately. I didn’t have time to re-verify the test. C/C++ is faster if you put on full compiler optimization settings.