I parse RSS a lot. My newsbot (that automatically finds new newssources) parses around 12000 feeds each day and heavily filters results via bayes and hidden markov. If i can shed some seconds runtime somewhere i’m willing to try it, because the scripts run on a 266mhz P2 machine (that consumes nearly no power at all).
After reading all the buzz about Simplepie (”Faster than a speeding bullet”) i decided to give it a try. After trying to break both pies (using invalid feeds, invalid unicode and so on) i decided to benchmark them. I tested Magpie 0.72 and Simplepie 1.0b2.
Magpie has sometimes problems with unicode, seems to have no documentation at all, parses even really fucked-up feeds, and is widely used. And you can feed it raw text, not only URLs. It uses the widly-used Snoopy-Class to get http-data. Caches serialized data als file (filename is the md5-encoded feed-url).
Simplepie seems to cope with unicode in all cases, has a bloated but existing documentation and seems to parse most trash i gave it. It can only work on an URL-basis, no chance to use proxies or Last-Modified tricks. Uses fopen(). Caches serialized data as file (filename is the urlencoded feel-url). Has been hyped on digg and elsewhere.
These are the two test-files i used (exported from the newsbot):
These are the two test-scripts i ran:
|test1||25 seconds||??? seconds|
|test2||5 seconds||67 seconds|
I stopped test1 for Simplepie after 5 minutes, because it consumed way to much memory and hogged my machine heavily.
Looks like Simplepie is more like “Slower than a mule on dope”. Magpie is so much faster than Simplepie, it disqualifies Simplepie. I wonder if any person that hyped Simplepie as turbo-fast has ever tried to benchmark it against Magpie. Try it yourself, all needed files are above. These results are uncached, i did not compare cached results because both classes use the same cacheing method.
I tried to rip some code from magpie and simplepie, in the hope i could combine them with my own parser to have something fast and unicode-reliable. But after some minutes thought (while cooking) i started Sharpdevelop, wrote some C# code to replace my php-cronjob entirely (that parses, filters, accesses mysql and so on) and now i’m left with a cron-job that has no problems with unicode at all (and that without using dirty tricks) and runs about 25 seconds and is started every 5 minutes (instead of 5++ minutes runtime every 20 minutes). I should have done this months before.
A blog posting at http://codeninja.de/[...] caused some FUD to be spread around about SimplePie's performance.FUD ("Fear, uncertainty, doubt") is a marketing strategy (see Wikipedia), thus they are probably claiming that i wrote my article only to make simplepie look bad. The only reason for me to write this article was to comment on the disproportionate claims on the simplepie-website ("Faster than a speeding bullet"). If you claim to be this fast, you have to be faster than any competitor, and not slower. Or you should think about changing your over-the-top marketing arguments.
The feeds he DID test were remarkably "jacked-up"... to the point that you're not likely to come across in 99% of test cases.The testdata is a totally random cut from the rss-data you get when you visit these feeds:
http://technorati.com/tag/ninja http://del.icio.us/tag/ninja http://news.google.com/news?q=ninja&ie=UTF-8 http://blogmarks.net/tag/ninja http://search.msn.de/results.aspx?FORM=MSNH&CP=1252&q=ninja http://search.yahoo.com/search?p=ninja http://blogs.icerocket.com/search?q=ninja http://feedster.com/search.php?q=ninja&sort=date&ie=UTF-8&hl=&content=full&&limit=15 http://www.blogdigger.com/search?q=ninja http://www.plazoo.com/search/ninja.htm http://blogg.de/tag/ninja.htm http://www.findarticles.com/p/search?qt=ninja&qf=free&tb=art http://www.furl.net/furled.jsp?topic=ninja http://flickr.com/photos/tags/ninja http://blogsearch.google.com/blogsearch?hl=en&q=ninja&btnG=Search+Blogs http://video.google.com/videosearch?q=ninja+is%3Afree&page=1&lv=0&so=0I would say that this very much reflects the reality a parser has to face in the depths of the internet. I don't know why they try to claim that this data is so extremly special, that you come across it only in 1% of all feeds. To me it even sounds like they are suggesting i created this data only to make simplepie look bad (FUD you know?). Simply aggregate the listed urls too, pick some random entries, and you will end with a test-feed like i generated. I repeat: this is what reality gives us. This time i even included a copy of kottke.org's feed (copy) to disprove that my feed-examples were too jumbled.
He only tested pure speed between SimplePie and MagpieRSS... nothing else (there are other important factors besides speed alone).And the factors are? Both classes use php-arrays to access the data (after parsing) iconv and stuff to convert the data (while parsing), use serialization for cacheing and so on. I'm really interested what dark magic is lurking in the depths of magpie or simplepie that can heavily influence the footprint of a normal usage (that is: parsing/reading cache and output it/write it to a database).
I'd also like to see the shootout between more than just SimplePie and MagpieRSS. If we want to stick to PHP, we could go with SimplePie, MagpieRSS, CaRP/Grouper, and lastRSS.I did include lastrss this time (testfile), but not CaRP, because the author of CaRP wants me to subscribe to "7 ways to turn RSS into R$$" (which i'm surely not doing) before i can download it. I'm very shure that lastrss is the very top of parsingspeed you can achive, because it's pretty minimalistic and featureless.
Even if SimplePie sucks in an area, it would let us know what areas to work on, so having a real, valid, FUD-less shootout would be in our best interests.Yes. You should work on your code instead of claiming that the test-data was unfair, or trying to damage my credibility by claiming the main aim of this article was to spread FUD about simplepie. 🏠