hoodwink.d enhanced


Christoffer's Hpricot Goodies #

by why in inspect

So, in what ways have you guys extended Hpricot? I really enjoy this collection of accessories to Hpricot by Christoffer Sawicki, who also wrote the Hpricot-based HTML-to-feed library called Feedalizer.

He has one script that does gsub! on all text nodes in the document. Another script is for generating tables of contents from the headers on an HTML page. I imagine that would go great with Markdown and Textile. (See also: del.icio.us/tag/hpricot.)

said on 04 Jan 2007 at 17:28

Yeah thanks goes to us for letting him share it :))

But they are nice little trinkets of code, they are. Props to Qerub, aka Christoffer!

said on 04 Jan 2007 at 19:39

I have an HTML Scrubber based on Hpricot, but I’m currently working on redoing it so that scrub is part of Hpricot instead of being a separate class.

Thanks for making it so easy.

said on 05 Jan 2007 at 02:54

Does Hpricot supports opensearch types of XML ? I am having a difficult time to parse the following -


Also how do I deal with

  • http://link.com


  • above is valid xml but I can’t seems to figure it out..

    Any help..

  • said on 05 Jan 2007 at 03:04

    the tags didn’t show up.. lets try again.. Nope seems like I can’t provide an example here.. anyway opensearch tags ”:” in their tag name like “opensearch:title”

    also how to deal with single xml tags like “br /” or “link /”

    Thanks for your help..

    said on 05 Jan 2007 at 07:47

    Nice one Qerub, Hpricot Goodies is really useful. Thanks! ;)

    said on 05 Jan 2007 at 11:36

    Heh. Thanks for the publicity, but more important: thanks again for Hpricot!

    Yes, HTML Outliner is being used to generate table of contents for articles. I should probably bundle some code that takes the HTMLOutliner#outline tree and returns a multidimensional <ul> that is ready to be used.

    said on 05 Jan 2007 at 11:46

    UnderpantsGnome: That would be a great addition to the main lib. I actually really like the strip methods you’ve made. What other plans do you have?

    Andrew: Send me some XML . Hpricot doesn’t have problems parsing namespaces, however its xpath syntax doesn’t support namespaces since its a hybrid of CSS and XPath.

    said on 09 Jan 2007 at 09:49

    why: I was/am basically just moving the block from HtmlScrubber, less the config into my Hpricot additions. Mostly becasue then I could call it Hpricot::Scrub and that made me laugh.

    Then you could do:

    doc = Hpricot(open('http://slashdot.org/').read)

    I haven’t had any new needs for this, though I was considering making the config more like perl’s HTML ::Scrubber where you can specify global attributes to allow/deny but also specify attributes to allow/deny at the tag level.

    Other than that I’m open to suggestions.

    Thanks again for making this so easy to accomplish, I so didn’t want to rewrite HTML ::Scrubber from scratch.

    said on 11 Jan 2007 at 23:21

    why- I was playing around with Hpricot Scrub and it seems to have gotten unhappy since 0.4.86 (last working) I also have an image sneaking through that I don’t think should be. I have the current changes with a “test” that shows the failure on recent gems and the stray image on <= 0.4.86.

    You can grab it here if you’d like to take a look hpricot_scrub.zip

    As usual feedback welcome.

    said on 14 Jan 2007 at 22:47

    Well, it looks like scrub’s use of traverse_all_element is the source of the problem. If you remove stuff while it’s traversing, things end up getting skipped.

     >> a = [:cadillac, :driver, :teacup]
     >> a.each { |x| a.delete(x) }
     => [:driver]

    You know what I’m saying?

    said on 16 Jan 2007 at 17:23

    Oh, duh… this works better, unless you see a potential issue with this that I’m missing.

    children.reverse.each {|e|
      e.strip unless e.class == Hpricot::Text ||
    11 Jul 2010 at 20:48

    * do fancy stuff in your comment.