Scrapers with surfraw

Searching is more than googleing

surfraw is a tool to build great "scrapers". A scraper is a tool to extract content from the web automatically. It gets tricky when it comes to modern web-apps which embed content dynamically, but in case of the following it's "just" searching. The art of searching (presentation of Fravia at 22c3, html5 embedded) isn't widely known. - I think especially people in IT should train that ability or being taught: because the job always requires to search for specific information.
Every day there's something you need to google, bing, yahoo, use $search_engine for. Thing is: often these search engines do not work as you might expect. They're all commercial. They're all censored. And they're all putting those results that pay prior to those that don't. - Nevertheless which content or which context.

De-Crapification

The web is full of crap. To efficiently use it you need to filter.
It's like an abstracted gold-mine these days; like a digitalized Wild-West from the good old western movies. The better you're at digging, the richer you'll get. You can find everything! Information gathering can even be used as a supporting element for social engineering, but that's not what this article is about. It's just about finding stuff in no time. With surfraw.

Let's stay at the command-line. I don't know what Shell you use, but I'm with ZSH. If you're into BaSh it should work in the same way:

  1. #!/bin/zsh
  2. lynx -dump $argv[1]

Lynx has an interesing feature, which is to dump the web-text into stdout. Save that snippet as an executable in /usr/bin/dlynx or within your PATH.

  1. surfraw -browser="dlynx" google surfraw | head -n 50

Now run that and you'll know what surfraw is about :).

  2. [19]Surfraw - Wikipedia, the free encyclopedia
       According to its creator Julian Assange: "Surfraw provides a fast
       unix command line interface to a variety of popular WWW search
       engines and other artifacts ...

Not too bad!

Surfraw has support for a great number of indexes: from NASA technical reports to TPB, from springer to opensearch. It's a very powerful tool. Surely you can define another "-browser" option. Dillo is very neat.

- Or, on osX "open" for your default browser of choice, which hopefully has an ad-blocker and script-controlls. Dillo works very well on X11 btw.

  1. surfraw -browser="dlynx" google -country=DE surfraw | head -n 50

- To keep it German.

Use them all!

There's a large number of elvis. Take this for example:

  1. surfraw -browser="dlynx" freedb "pornophonique"

Take a look at the -help:

  1. Local options:
  2.   -artists                      Search artists
  3.                                 Environment: SURFRAW_cddb_artists
  4.   -albums                       Search albums
  5.                                 Environment: SURFRAW_cddb_albums
  6.   -songs                        Search songs
  7.                                 Environment: SURFRAW_cddb_songs
  8.   -rest                         Search the rest of the data
  9.                                 Environment: SURFRAW_cddb_rest
  10.   -all                          Search all fields
  11.                                 Environment: SURFRAW_cddb_all
  12.                                 Default: search artists and albums
  13.   -id                           Search by CDDB ID.
  14.   -bycat                        Sort results by category
  15.   -cat=CATEGORY                 Category to search, repeat as needed
  16.                                 Options:
  17.                                     all
  18.                                     blues
  19.                                     classical
  20.                                     country
  21.                                     data
  22.                                     folk
  23.                                     jazz
  24.                                     misc
  25.                                     newage
  26.                                     reggae
  27.                                     rock
  28.                                     soundtrack
  29.                                 Default: all
  30.   -page=PAGENUM                 Start at page PAGENUM
  31.                                 Default: 1

So now I'm playing: "pornophonique & procacci - i want to be a machine" - I think you got what this is about and find some inspiration on your own without having to cope with all the strange web-interfaces, ads and stuff.

Good luck finding stuff,
wishi

Post new comment

The content of this field is kept private and will not be shown publicly.

Ihr Browser versucht gerade eine Seite aus dem sogenannten Internet auszudrucken. Das Internet ist ein weltweites Netzwerk von Computern, das den Menschen ganz neue Möglichkeiten der Kommunikation bietet.

Da Politiker im Regelfall von neuen Dingen nichts verstehen, halten wir es für notwendig, sie davor zu schützen. Dies ist im beidseitigen Interesse, da unnötige Angstzustände bei Ihnen verhindert werden, ebenso wie es uns vor profilierungs- und machtsüchtigen Politikern schützt.

Sollten Sie der Meinung sein, dass Sie diese Internetseite dennoch sehen sollten, so können Sie jederzeit durch normalen Gebrauch eines Internetbrowsers darauf zugreifen. Dazu sind aber minimale Computerkenntnisse erforderlich. Sollten Sie diese nicht haben, vergessen Sie einfach dieses Internet und lassen uns in Ruhe.

Die Umgehung dieser Ausdrucksperre ist nach §95a UrhG verboten.

Mehr Informationen unter www.politiker-stopp.de.