Scraper

From SnapMap

Jump to: navigation, search

A scraper is a program that extracts data from a particular type of website, usually inserting this data into a local database.

Contents

What do scrapers do in SnapMap?

In SnapMap, a scraper loads a given webpage, detects if it represents a photo, and is so creates a new photo record. It then adds tags, depending on what could be determined about the photo from the webpage. The most important of these tags are:

  • src (tag), a URL to the actual image file, ideally at a size suitable for a typical webpage (around 500x500 pixels);
  • webpage (tag), a URL to the source webpage that has just been consumed by the scraper.

Other useful tags may also be imported at this stage, such as:

In SnapMap’s UIs for adding new images, the decision of which scraper to run is determined by regular expression: a database table relates each scraper to a regular expression that is the pattern of URL that it can process.

List of SnapMap scrapers

SnapMap scrapers have been written for:

  • English Heritage Viewfinder
  • Flickr
  • From Old Books
  • Google Picasa
  • Images of England
  • Leodis
  • Library of Congress
  • National Gallery (London)
  • Wikipedia (including Wikimedia)

This means that images hosted at the above websites are particularly easy to add to SnapMap.

Writing your own scrapers (advanced)

Most websites that host large numbers of images produce HTML that can fairly easily be parsed for the photo file locations and photo metadata. Such websites are ideal candidates to have scrapers written for them. Scrapers are usually fairly easy to write. The difficulties come from websites that produce HTML that is difficult to parse or inconsistent. The key is to try to get into the mind of the guy who wrote the script that generated the HTML...

In SnapMap, a scraper is a PHP script having a scrape() function that returns a $properties array. Here’s the idea:

function scrape ($source_url)
{
	// Load the contents of $source_url into $source using this function that is already defined
	// - for testing, write your own getRemoteFile() using PHP's curl library or other techniques
	$source = getRemoteFile ($source_url);
	
	// initialize $properties array
	$properties = array();
	$properties['_extra_tags'] = array();

	// Your code here extracts data from the returned HTML or XML
	// - for XML streams, use SimpleXML
	// - for HTML use strpos() and regular expressions (SimpleXML is risky due to general HTML cruft)

	// mandatory fields
	$properties['webpage'] = $source_url;
	$properties['src'] = <insert direct link to medium sized image file here>;

	// nice to have fields
	$properties['archive'] = 'Name of the archive where this photo is from';
	$properties['width'] = <insert your code here>;
	$properties['height'] = <insert your code here>;
	$properties['thumbnail'] = <insert direct link to thumbnail image file here>;
	$properties['date'] = <insert your string here>;
	$properties['datetime'] = <insert your string here (use this if you have the time too)>;
	$properties['title'] = <insert your string here>;
	$properties['description'] = <insert your string here>;
	$properties['author'] = <insert your string here>;

	// location
	$properties['geo:lat_long'] = <insert your latlong here>;
	$properties['ge:heading'] = <insert your heading here>;

	// other tags
	$properties['_extra_tags'][] = <keyless tag #1>
	$properties['_extra_tags'][] = <keyless tag #2>
	$properties['_extra_tags'][] = <keyless tag #3>
	$properties['_extra_tags'][] = <keyless tag #4>

	// return the populated array
	return ($properties);
}

Registering your new scraper with SnapMap

To add your new scraper to the standard scrapers, so it can be used automatically by anybody, email the script to Laurence, along with a suggested regular expression to detect the URLs that it can process.

See also

Personal tools