hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

Hpricot the OhFourth #

by why in inspect

Well, here’s a new release of Hpricot: 0.4. This I didn’t expect. Thank the persistent fellows who kept hitting my inbox. They’re cited in the CHANGELOG for sending me all manner of palsied HTML with quoting all askew and tilted.

In fact, did you know that you can give Hpricot a plain text file, say one that has just a few HTML snips in it, and you can alter those snips and then output the page again and it just works like that? This child’s only five weeks old still, so there are still encoding and entity and namespace matters to see to.

To install: gem install hpricot. Win32 and source gems.

What Does One Do With Hpricot?

  • To learn about using Hpricot, try AnHpricotShowcase, which gives a bunch of common examples.
  • Christoffer Sawicki’s Feedalizer uses Hpricot to turn plain HTML pages into feeds. See?
  • WWW::Mechanize trunk now uses Hpricot for its automated browsing.
  • And Zed’s RFuzz site has some Hpricot sample code, if you’re ready to ditch Net::HTTP for some superior socketry. (See also: RFuzz::Browser!)
said on 11 Aug 2006 at 15:13

Very cool _why. I’m using hpricot to scrape dirty dirty insurance company websites for data extraction. It handles them fats and better then other solutions I’ve tried.

said on 11 Aug 2006 at 16:01

Digging with Hpricot is just plain fun.

Thank you, _why.

said on 11 Aug 2006 at 18:15

I knew something was up with Hpricot. I had just set my walpaper to the Hpricot logo!

I use Hpricot to scrape the Second Life status page durring downtimes, and use mpd.rb to activate my music when it comes up/goes down.

said on 11 Aug 2006 at 20:03

Hrm, I really should start using Hpricot.

I was going over the HpricotBasics page and I noticed something that seems incorrect. It says you can re-search an Hpricot::Elements, which works, but the example given returns another Hpricot::Elements as the result of the sub-search. When I try it, I get a simple array back.

said on 11 Aug 2006 at 20:37

Okay, [43], thanks kballard.

said on 11 Aug 2006 at 23:31

Nice.

I wonder if we’re overly enamored with the xpath syntax. I normally hate method_missing, but perhaps it has its place here. Instead of:
doc/:html/:body/:p/:img
How about?
doc.html.body.p.img
Too CGI ? Other problems?
said on 12 Aug 2006 at 08:44

Sounds cool.

I’m on MacOS X 10 .7. I tried to install, told me it couldn’t find ragel and rlcodegen. OK, I’ll start with ragel. I tried using DarwinPorts, that dies trying to ./configure bison … Ok so I just load the archive from here: http://www.cs.queensu.ca/home/thurston/ragel/ragel-5.11.tar.gz … it tells me I need bison, flex gperf … I’ve already got bison and flex installed … so I get an archive of gperf here: http://ftp.wayne.edu/pub/gnu/gperf/gperf-3.0.2.tar.gz … but I get this error: “checking whether the C compiler works… configure: error: cannot run C compiled programs. If you meant to cross compile, use `—host’.”

Has anybody else had this problem?

Thanks

said on 12 Aug 2006 at 10:00

stephenb: How were you trying to install? Using the gem or from trunk?

For those who want to install from source: hpricot-0.4.tgz.

said on 12 Aug 2006 at 13:02

From subversion:

svn co https://code.whytheluckystiff.net/svn/hpricot/trunk hpricot => ... Checked out revision 43. rake => sh: line 1: rlcodegen: command not found sh: line 1: ragel: command not found

That’s when I starting looking for ragel.

said on 12 Aug 2006 at 15:41

There appear to be some ports of gperf for Darwin (google). Also, using the FreeBSD port of ragel might work (mabye, possibly, probably not), since Darwin is built on Mach3/FreeBSD.

said on 13 Aug 2006 at 01:56

I’ve completely ported my REXML code to Hpricot and couldn’t be happier with the speed improvements :)

But now I’m facing a very embarrassing issue (as in, I should have though of it before starting): Hpricot doesn’t substitute html entities :/

said on 15 Aug 2006 at 10:33

why,

In elements.rb

204         filter ":last-child" do |i|
205           self == parent.containers.first
206         end

Line 205 should be

self == parent.containers.last

said on 18 Aug 2006 at 01:44

Wow. Just wow. It’s awesome to be able to parse a fragment or snippet into an object model.

I want my validator to be strict, but I want my parser to be loose. This works exactly how I’ve wanted something processing markup to work.

For those of us living in the real world where we can’t necessarily demand that other parties fix up their output or provide nice little parseable feeds.

Thanks for this gem, the mashup possibilities will be interesting :D

said on 18 Aug 2006 at 07:55

why: Shouldn’t the wrap() method be called wrap!() instead? I searched for the result for a while until I discovered it directly modifies the document…

Tell me if I am wrong.

cheers

said on 18 Aug 2006 at 12:00
newb question… what if I want to grab all the tags between two other tags? But they are not hierarchical. E.g.
  < h1>stuff< /h1>
  < h1>diff stuff< /h1>
  < p>something< /p>
  < div id="a">hmmm< /div>
and I want everything between the first h1 and the div with id ‘a’?
said on 18 Aug 2006 at 14:48

skwasha: I just started playing with Hpricot , so there’s probably a better way to do it, but something like this would work:


require('hpricot')
HTML = %{
  <h1>stuff</h1>
  <h1>diff stuff</h1>
  <p>something</p>
  <div id="a">hmmm</div>
}
doc   = Hpricot(HTML)
found = []
front = true
doc.traverse_all_element { |node|
  if (node.name == 'h1' and front)
    front = false
    next
  elsif (node.name == 'div' and
         node.attributes['id'] == 'a')
    next
  else
    found << node
  end
}
# do something useful here...
found.collect! { |item| item.to_html + "\n" }
puts(found)
# >> <h1>diff stuff</h1>
# >> <p>something</p>
said on 19 Aug 2006 at 16:10

_why, thanks for sharing Hpricot.

One request: can you handle non XHTML compliant tags like <image>, <hr> and <br>?

Currently:


d = Hpricot("<image src='xyz'>abc<p>a simple line</p>" 
d.innerHTML

returns

d.innerHTML
=> "<image src=\"xyz\">abc<p>a simple line</p></image>" 

Thanks!

said on 19 Aug 2006 at 16:27

oops, my bad. I guess I just don’t know my HTML from

<image> and <img>
!

It’s really cool you do support these weird tags!

said on 20 Aug 2006 at 16:23

Where was Hpricot 5 months ago when I needed to parse 10,000 inconsistent HTML pages for content and stuff them into a database. I had Regex coming out my nose. This would have been simpler and cleaner.

said on 23 Aug 2006 at 21:26
why, you rock or “How I made an HTML scrubber in under 25 lines”.

module Hpricot
  class Elements
    def strip
      each { |x| x.strip }
    end

    def strip_attributes(safe=[])
      each { |x| x.strip_attributes(safe) }
    end
  end

  class Elem
    def strip
      parent.replace_child self, Hpricot.make(inner_html)
    end

    def strip_attributes(safe=[])
      attributes.each {|atr|
          remove_attribute(atr[0]) unless
            safe.include?(atr[0])
      } unless attributes.nil?
    end
  end
end

# remove all anchors leaving behind the text inside.
(doc/:a).strip 
# strip all attributes except for src from all images
(doc/:img).strip_attributes(['src']) 
said on 27 Aug 2006 at 23:21

Thanks for Hpricot, why_. I’ve been using it for a number of things and it’s been a lifesaver. However, I came across a parser bug which occurs when you have a large HTML table with unclosed td tags—I get a stack overflow.

Doesn’t look like the Trac accepts tickets, so I’m filing it here. ;-)

Hpricot(open(‘http://survey.netcraft.com/Reports/0512/’).read)

11 Jul 2010 at 21:19

* do fancy stuff in your comment.

PREVIEW PANE