Simplepie vs. Magpie: A RSS Parser shootout (Updated!)

I parse RSS a lot. My newsbot (that automatically finds new newssources) parses around 12000 feeds each day and heavily filters results via bayes and hidden markov. If i can shed some seconds runtime somewhere i’m willing to try it, because the scripts run on a 266mhz P2 machine (that consumes nearly no power at all).

After reading all the buzz about Simplepie (”Faster than a speeding bullet”) i decided to give it a try. After trying to break both pies (using invalid feeds, invalid unicode and so on) i decided to benchmark them. I tested Magpie 0.72 and Simplepie 1.0b2.

Magpie has sometimes problems with unicode, seems to have no documentation at all, parses even really fucked-up feeds, and is widely used. And you can feed it raw text, not only URLs. It uses the widly-used Snoopy-Class to get http-data. Caches serialized data als file (filename is the md5-encoded feed-url).

Simplepie seems to cope with unicode in all cases, has a bloated but existing documentation and seems to parse most trash i gave it. It can only work on an URL-basis, no chance to use proxies or Last-Modified tricks. Uses fopen(). Caches serialized data as file (filename is the urlencoded feel-url). Has been hyped on digg and elsewhere.

The Test

These are the two test-files i used (exported from the newsbot):

These are the two test-scripts i ran:

Results

Magpie Simplepie
test1 25 seconds ??? seconds
test2 5 seconds 67 seconds

I stopped test1 for Simplepie after 5 minutes, because it consumed way to much memory and hogged my machine heavily.

Looks like Simplepie is more like “Slower than a mule on dope”. Magpie is so much faster than Simplepie, it disqualifies Simplepie. I wonder if any person that hyped Simplepie as turbo-fast has ever tried to benchmark it against Magpie. Try it yourself, all needed files are above. These results are uncached, i did not compare cached results because both classes use the same cacheing method.

I tried to rip some code from magpie and simplepie, in the hope i could combine them with my own parser to have something fast and unicode-reliable. But after some minutes thought (while cooking) i started Sharpdevelop, wrote some C# code to replace my php-cronjob entirely (that parses, filters, accesses mysql and so on) and now i’m left with a cron-job that has no problems with unicode at all (and that without using dirty tricks) and runs about 25 seconds and is started every 5 minutes (instead of 5++ minutes runtime every 20 minutes). I should have done this months before.

UPDATE!

The authors of simplepie reacted to my test. Let's check their claims, one by one, against what reality tells us.
A blog posting at http://codeninja.de/[...] caused some FUD to be spread around about SimplePie's performance.
FUD ("Fear, uncertainty, doubt") is a marketing strategy (see Wikipedia), thus they are probably claiming that i wrote my article only to make simplepie look bad. The only reason for me to write this article was to comment on the disproportionate claims on the simplepie-website ("Faster than a speeding bullet"). If you claim to be this fast, you have to be faster than any competitor, and not slower. Or you should think about changing your over-the-top marketing arguments.
The feeds he DID test were remarkably "jacked-up"... to the point that you're not likely to come across in 99% of test cases.
The testdata is a totally random cut from the rss-data you get when you visit these feeds:
http://technorati.com/tag/ninja
http://del.icio.us/tag/ninja
http://news.google.com/news?q=ninja&ie=UTF-8
http://blogmarks.net/tag/ninja
http://search.msn.de/results.aspx?FORM=MSNH&CP=1252&q=ninja
http://search.yahoo.com/search?p=ninja
http://blogs.icerocket.com/search?q=ninja
http://feedster.com/search.php?q=ninja&sort=date&ie=UTF-8&hl=&content=full&&limit=15
http://www.blogdigger.com/search?q=ninja
http://www.plazoo.com/search/ninja.htm
http://blogg.de/tag/ninja.htm
http://www.findarticles.com/p/search?qt=ninja&qf=free&tb=art
http://www.furl.net/furled.jsp?topic=ninja
http://flickr.com/photos/tags/ninja
http://blogsearch.google.com/blogsearch?hl=en&q=ninja&btnG=Search+Blogs
http://video.google.com/videosearch?q=ninja+is%3Afree&page=1&lv=0&so=0
I would say that this very much reflects the reality a parser has to face in the depths of the internet. I don't know why they try to claim that this data is so extremly special, that you come across it only in 1% of all feeds. To me it even sounds like they are suggesting i created this data only to make simplepie look bad (FUD you know?). Simply aggregate the listed urls too, pick some random entries, and you will end with a test-feed like i generated. I repeat: this is what reality gives us. This time i even included a copy of kottke.org's feed (copy) to disprove that my feed-examples were too jumbled.
He only tested pure speed between SimplePie and MagpieRSS... nothing else (there are other important factors besides speed alone).
And the factors are? Both classes use php-arrays to access the data (after parsing) iconv and stuff to convert the data (while parsing), use serialization for cacheing and so on. I'm really interested what dark magic is lurking in the depths of magpie or simplepie that can heavily influence the footprint of a normal usage (that is: parsing/reading cache and output it/write it to a database).
I'd also like to see the shootout between more than just SimplePie and MagpieRSS. If we want to stick to PHP, we could go with SimplePie, MagpieRSS, CaRP/Grouper, and lastRSS.
I did include lastrss this time (testfile), but not CaRP, because the author of CaRP wants me to subscribe to "7 ways to turn RSS into R$$" (which i'm surely not doing) before i can download it. I'm very shure that lastrss is the very top of parsingspeed you can achive, because it's pretty minimalistic and featureless.

Results (again)

Anyway, i benchmarked again. And i can't verify the numbers they are listing on their website. I used the same php+rss files listed above on a (100% idle) machine with PHP5.2 and below are the results i keep getting.

simplepie_1.0beta3magpie0.72lastrss0.9.1
rss115.539 seconds
1.601.120 bytes
2.327 seconds
1.037.576 bytes
1.967 seconds
772.432 bytes
rss21.677 seconds
170.512 bytes
0.439 seconds
144.168 bytes
0.346 seconds
123.216 bytes
kottke810 ms
73.120 bytes
157 ms
76.400 bytes
88 ms
69.720 bytes

To be honest: This was to be expected. Seeing massive-speedimprovements in a software that stepped up one beta-notch is not very common. And this time i dont repeat myself, i'm takeing the unusual step in saying DON'T TRUST MY NUMBERS, TRY IT YOURSELF!.
Even if SimplePie sucks in an area, it would let us know what areas to work on, so having a real, valid, FUD-less shootout would be in our best interests.
Yes. You should work on your code instead of claiming that the test-data was unfair, or trying to damage my credibility by claiming the main aim of this article was to spread FUD about simplepie.
Created on 28.09.2006 | Tags [ ]
hometop