Okay, Give Hpricot 0.2 a Go #
This time I’m giving a balloon out which can be used for quick testing.
http://balloon.hobix.com/hpricot
Or, if you want to install Hpricot 0.2:
gem install hpricot --source code.whytheluckystiff.net
So the Hpricot parser is basically complete. There’s still lots of fiddling ahead: it doesn’t handle Javascript whatsoever and it’s not yet as flexible as HTree. However, it does fix alot of HTML that RubyfulSoup and the htmltools won’t.
Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.
user system total real hpricot: 10.515625 0.000000 10.515625 ( 10.610571) scrapi: 32.546875 0.093750 32.640625 ( 32.923535) htree: 56.609375 0.023438 56.632812 ( 57.096530) rubyfulsoup: 29.289062 0.046875 29.335938 ( 29.586510) mechanize:(*) 148.132812 1.101562 149.234375 (150.621922) htmltok:(*) 19.632812 0.007812 19.640625 ( 19.795446)
(*)
These libs are a bit more primitive, focusing only on reading documents, no calls are given for modifying documents.
The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.
I didn’t include scrapi
because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all p.posted
elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.
Update: Thanks to assaf, I got scrapi working with libtidy and reporting back the right answers. Thankya! Update #2: An htmltokenizer benchmark.
Dan W
Excellent work. Who’s going to be the first to make a Ruby version of Pornolize then?
anon
Sounds great, I’ll give it a try on http://news.bbc.co.uk – RubyfulSoup handles that so slowly last time I tried.
netghost
I don’t know what I’ll use it for… but I have a deep desire to play with this. Much love for the JQuery style expresions! They made me want to use JQuery… but I couldn’t figure out what I’d use it for.
FlashHater
W00t! Balloon is truley usefull!
serg
So, it works if I pass in an IO object to Hpricot.parse (from either a file or a url, like the balloon). It doesn’t work if I pass in a string (i.e., the name of the file to parse, like the example.)
assaf
For scrapi, you’ll have to use Tidy for now. The non-tidy parser doesn’t deal with bad HTML , which is why I’m looking for an alternative that can clean HTML well and fast.
With today’s code drop you can do something like:why
seg: Ohhh, you’re right. That example was totally wrong!
Hpricot.parse
takes an HTML string or an IO object containing HTML .assaf: Hurray, that works.
josh
Just to ask a really dumb question, but if I have an Element, how can I get the text found within the tag?
Thanks for the library and sorry for the question.
why
For now, you’ll need to loop through the
children
of the Element. Some of those will beText
objects which have acontent
property containing the string.The next version will have an innerHTML property on every element.
Jerome
Is there a reason there was no comparison with HTMLTokenizer? Or is that because it’s not even in the same ballpark as the rest?
Hank
why
Jerome: Okay. Htmltokenizer is pretty quick, but read-only. But I’m really glad you mentioned this one, because I could offer access to the Hpricot tokenizer, which would speed things up by literally an order of magnitude.
In fact, you can already get access to this by using
Hpricot.scan
.Which give you back:
Basically: (1) a symbol describing the element type, (2) the tag name or text content, (3) an attributes hash, and (4) the raw string which formed this token.
The scanning stage is easy. It’s the figuring out the layout of the document and coercing wellformedness that’s the spiny one.
why
anon: news.bbc.co.uk was broken (for me) in Hpricot 0.2, but it’s working in trunk. So is McSweeney’s (awful HTML .) More, more, anymore really really bad HTML sites I can use?
thomas
Found a “bug”. The Scanner fails when it encounters
<!---->
msg = “negative string size (or size too big) (ArgumentError)”
thomas
hey Preview shows something else than the actual Comment! Well anyways, the scanner fails when it encounters an empty HTML Comment. See if this is right
<!---->
why
thomas: That little oddity is fixed in trunk. McSweeney’s has one of those suckers.
thomas
Tried trunk, didnt work. @ svn co https://code.whytheluckystiff.net/svn/hpricot/trunk hpricot cd hpricot rake install
“Successfully installed hpricot, version 0.2”
require ‘rubygems’ require_gem ‘hpricot’, ”>=0.2”
doc = Hpricot.parse(”<!
—>”) @ Fails .. missing something?thomas
sorry these comments are killing me, I should RTFM
why
Oh, do:
You’ll need Ragel installed to build the new scanner.
thomas
Thanks, that did it.
Not sure if that is of any use to you. But I needed it: http://rafb.net/paste/results/bVlGWd11.html
doc.get_elements_by_tag_name('h3').each { |tag| puts tag.inner_text }
luke redpath
Great little library – love it. I’ve written a small extension for Test::Unit that lets you test your Rails views using hpricot instead of the clunky assert_tag function.
Hpricot Test Extension for Rails
trans
Cleary Hpricot is for HTML , but how might it fair with strict XML ?
need for speed
In the benchmark i miss a comparison with ruby-libxml!
jm
so does this not work on windows?
probablyCorey
If you can parse this site then Hpricot is the magical!
why
probablyCorey: Oh, wow, that is hideous. Three nested HTML pages. Hpricot does it, but I really don’t know what’s correct in this case.
not
Any chance of making a “pure” ruby version?
why
Binaries will be out in 0.5. Watch the map.
ryan
I found a borken page for you. At least it broke Hpricot 0.3.
Broken
`build_node': [bug] unknown structure: [:xmlprocins, "@include(\"ocregister/includes/global/login_table.php\");", nil, nil] (Exception)
mae
will hpricot work in ruby 1.8.2?