Creating a Screen Scraper

Dec 06, 2009 Author: vvaswani

A screen scraper program accesses a web page and picks through the HTML for interesting or useful data. Here's a very simple one that extracts all hyperlinks from a page and then categorizes them. This scraper includes a lot of regular expressions, so let's take it one step at a time. First, let's check that the input (in $_REQUEST["page"]) is actually a hyperlink and not someone trying to monkey around with files on the local system:

<?php
$page = $_REQUEST["page"];
if (!preg_match('|^https{0,1}://|', $page)) {
    print "URL $page invalid or unsupported.";
    exit;
}

Let's say that this checks out, so now it's time to get the data and extract all of the hyperlinks in the anchor tags. Notice that we're using the simple file_get_contents() function instead of cURL; this assumes that we don't need any of cURL's fancy features such as HTTP authentication or cookie management.

$data = file_get_contents($page);
preg_match_all('|]*href="([^"]+)"|i', $data, $matches);

Now all of the hyperlinks are in $matches[1] (remember that $matches[0] contains all the matches). Let's initialize some arrays that we'll use to store and categorize the hyperlinks:

$all_links = array();
$js_links = array();
$full_links = array();
$local_links = array();

It's time to run through all of the links and do the real work of categorization. First, we make sure that we haven't already seen this link. If it's a new one, we'll use several regular expressions to determine what kind of link it is:

foreach ($matches[1] as $link) {
    if ($all_links[$link]) {
        continue;
    }
    $all_links[$link] = true;

    if (preg_match('/^javascript:/', $link)) {
        $js_links[] = $link;
    } elseif (preg_match('/^https{0,1}:/i', $link)) {
        $full_links[] = $link;
    } else {
        $local_links[] = $link;
    }
}

Now it's time to print the results of the analysis:

print '<table border="0">';
print "<tr><td>number of links:</td><td>";
print strval(count($matches[1])) . "</td></tr>";
print "<tr><td>unique links:</td><td>";
print strval(count($all_links)) . "</td></tr>";
print "<tr><td>local links:</td><td>";
print strval(count($local_links)) . "</td></tr>";
print "<tr><td>full links:</td><td>";
print strval(count($full_links)) . "</td></tr>";
print "<tr><td>javascript junk:</td><td>";
print strval(count($js_links)) . "</td></tr>";
print '</table>';
?>

tags: php web scraper

views 4188
  1. Add New Comment