hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Okay, Give Hpricot 0.2 a Go #

by why in inspect

This time I’m giving a balloon out which can be used for quick testing.

http://balloon.hobix.com/hpricot

Or, if you want to install Hpricot 0.2:

gem install hpricot --source code.whytheluckystiff.net

So the Hpricot parser is basically complete. There’s still lots of fiddling ahead: it doesn’t handle Javascript whatsoever and it’s not yet as flexible as HTree. However, it does fix alot of HTML that RubyfulSoup and the htmltools won’t.

Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.

                     user     system      total        real
 hpricot:       10.515625   0.000000  10.515625 ( 10.610571)
 scrapi:        32.546875   0.093750  32.640625 ( 32.923535)
 htree:         56.609375   0.023438  56.632812 ( 57.096530)
 rubyfulsoup:   29.289062   0.046875  29.335938 ( 29.586510)
 mechanize:(*) 148.132812   1.101562 149.234375 (150.621922)
 htmltok:(*)    19.632812   0.007812  19.640625 ( 19.795446)

(*) These libs are a bit more primitive, focusing only on reading documents, no calls are given for modifying documents.

The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.

I didn’t include scrapi because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all p.posted elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.

Update: Thanks to assaf, I got scrapi working with libtidy and reporting back the right answers. Thankya! Update #2: An htmltokenizer benchmark.

said on 05 Jul 2006 at 13:21

Excellent work. Who’s going to be the first to make a Ruby version of Pornolize then?

said on 05 Jul 2006 at 13:38

Sounds great, I’ll give it a try on http://news.bbc.co.uk – RubyfulSoup handles that so slowly last time I tried.

said on 05 Jul 2006 at 13:46

I don’t know what I’ll use it for… but I have a deep desire to play with this. Much love for the JQuery style expresions! They made me want to use JQuery… but I couldn’t figure out what I’d use it for.

said on 05 Jul 2006 at 14:56

W00t! Balloon is truley usefull!

said on 05 Jul 2006 at 17:25

So, it works if I pass in an IO object to Hpricot.parse (from either a file or a url, like the balloon). It doesn’t work if I pass in a string (i.e., the name of the file to parse, like the example.)

said on 05 Jul 2006 at 17:26

For scrapi, you’ll have to use Tidy for now. The non-tidy parser doesn’t deal with bad HTML , which is why I’m looking for an alternative that can clean HTML well and fast.

With today’s code drop you can do something like:
# Set it to use Tidy.
Scraper::Base.tidy_options({})

# Define a scraper.
boing_boing = Scraper.define do
  array :posts
  process "p.posted", :posts=>:node
  result :posts
end

# Scrape away!
puts boing_boing.scrape(html).size
said on 05 Jul 2006 at 18:44

seg: Ohhh, you’re right. That example was totally wrong! Hpricot.parse takes an HTML string or an IO object containing HTML .

assaf: Hurray, that works.

said on 05 Jul 2006 at 19:28

Just to ask a really dumb question, but if I have an Element, how can I get the text found within the tag?

Thanks for the library and sorry for the question.

said on 05 Jul 2006 at 20:17

For now, you’ll need to loop through the children of the Element. Some of those will be Text objects which have a content property containing the string.

The next version will have an innerHTML property on every element.

said on 06 Jul 2006 at 01:24

Is there a reason there was no comparison with HTMLTokenizer? Or is that because it’s not even in the same ballpark as the rest?

said on 06 Jul 2006 at 02:03
Okay, for anyone else searching for some straight example code from the Hpricot posts, here’s some that works:
wget http://redhanded.hobix.com/index.html

require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
doc = Hpricot.parse(open("index.html"))
(doc/:p/:a).each do |link|
  p link.attributes
end

said on 06 Jul 2006 at 09:10

Jerome: Okay. Htmltokenizer is pretty quick, but read-only. But I’m really glad you mentioned this one, because I could offer access to the Hpricot tokenizer, which would speed things up by literally an order of magnitude.

In fact, you can already get access to this by using Hpricot.scan.

 doc = Hpricot.scan(open("index.html)) do |token|
   p token
 end

Which give you back:

 [:doctype, "html", {"system_id"=>"\"DTD/xhtml1-transitional.dtd\"", "publid_id"=>"PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" 
"}, "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"DTD/xhtml1-transitional.dtd\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "html", {"xml:lang"=>"en", "lang"=>"en", "xmlns"=>"http://www.w3.org/1999/xhtml"}, "<html xmlns=\"http://www.w3.org/1999/xh
tml\" lang=\"en\" xml:lang=\"en\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "head", nil, "<head>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "meta", {"content"=>"text/html; charset=utf-8", "http-equiv"=>"Content-Type"}, "<meta http-equiv=\"Content-Type\" conte
nt=\"text/html; charset=utf-8\" />"]
 [:text, "\n", nil, "\n"]
 [:stag, "title", nil, "<title>"]
 [:text, "RedHanded &raquo; sneaking Ruby through the system", nil, "RedHanded &raquo; sneaking Ruby through the system"]
 [:etag, "title", nil, "</title>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "link", {"href"=>"http://redhanded.hobix.com/index.xml", "title"=>"RSS", "rel"=>"alternate", "type"=>"application/rss+x
ml"}, "<link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://redhanded.hobix.com/index.xml\" />"]
 [:text, "\n", nil, "\n"]
 ...

Basically: (1) a symbol describing the element type, (2) the tag name or text content, (3) an attributes hash, and (4) the raw string which formed this token.

The scanning stage is easy. It’s the figuring out the layout of the document and coercing wellformedness that’s the spiny one.

said on 06 Jul 2006 at 12:13

anon: news.bbc.co.uk was broken (for me) in Hpricot 0.2, but it’s working in trunk. So is McSweeney’s (awful HTML .) More, more, anymore really really bad HTML sites I can use?

said on 06 Jul 2006 at 12:15

Found a “bug”. The Scanner fails when it encounters &lt;!----&gt;

msg = “negative string size (or size too big) (ArgumentError)”

said on 06 Jul 2006 at 12:18

hey Preview shows something else than the actual Comment! Well anyways, the scanner fails when it encounters an empty HTML Comment. See if this is right <!---->

said on 06 Jul 2006 at 12:31

thomas: That little oddity is fixed in trunk. McSweeney’s has one of those suckers.

said on 06 Jul 2006 at 12:44

Tried trunk, didnt work. @ svn co https://code.whytheluckystiff.net/svn/hpricot/trunk hpricot cd hpricot rake install

“Successfully installed hpricot, version 0.2”

require ‘rubygems’ require_gem ‘hpricot’, ”>=0.2”

doc = Hpricot.parse(”<!>”) @ Fails .. missing something?

said on 06 Jul 2006 at 12:45

sorry these comments are killing me, I should RTFM

said on 06 Jul 2006 at 13:05

Oh, do:

 cd hpricot
 rake ragel
 rake install

You’ll need Ragel installed to build the new scanner.

said on 06 Jul 2006 at 13:16

Thanks, that did it.

Not sure if that is of any use to you. But I needed it: http://rafb.net/paste/results/bVlGWd11.html

doc.get_elements_by_tag_name('h3').each { |tag| puts tag.inner_text }

said on 07 Jul 2006 at 06:27

Great little library – love it. I’ve written a small extension for Test::Unit that lets you test your Rails views using hpricot instead of the clunky assert_tag function.

Hpricot Test Extension for Rails

said on 08 Jul 2006 at 12:41

Cleary Hpricot is for HTML , but how might it fair with strict XML ?

said on 08 Jul 2006 at 14:33

In the benchmark i miss a comparison with ruby-libxml!

said on 10 Jul 2006 at 09:08

so does this not work on windows?

said on 14 Jul 2006 at 08:48

If you can parse this site then Hpricot is the magical!

said on 18 Jul 2006 at 11:58

probablyCorey: Oh, wow, that is hideous. Three nested HTML pages. Hpricot does it, but I really don’t know what’s correct in this case.

said on 18 Jul 2006 at 18:47

Any chance of making a “pure” ruby version?

said on 19 Jul 2006 at 11:25

Binaries will be out in 0.5. Watch the map.

said on 21 Jul 2006 at 15:03

I found a borken page for you. At least it broke Hpricot 0.3.

Broken

`build_node': [bug] unknown structure: [:xmlprocins, "@include(\"ocregister/includes/global/login_table.php\");", nil, nil] (Exception)
said on 25 Jul 2006 at 01:59

will hpricot work in ruby 1.8.2?

11 Jul 2010 at 21:27

* do fancy stuff in your comment.

PREVIEW PANE