Creating a Screen Scraper
A screen scraper program accesses a web page and picks through the HTML for interesting or useful data. Here's a very simple one that extracts all hyperlinks from a page and then categorizes them. This scraper includes a lot of regular expressions, so let's take it one step at a time. First, let's check that the input (in $_REQUEST["page"]) is actually a hyperlink and not someone trying to monkey around with files on the local system:
<?php
$page = $_REQUEST["page"];
if (!preg_match('|^https{0,1}://|', $page)) {
print "URL $page invalid or unsupported.";
exit;
}
Let's say that this checks out, so now it's time to get the data and extract all of the hyperlinks in the anchor tags. Notice that we're using the simple file_get_contents() function instead of cURL; this assumes that we don't need any of cURL's fancy features such as HTTP authentication or cookie management.
$data = file_get_contents($page);
preg_match_all('|]*href="([^"]+)"|i', $data, $matches);
Now all of the hyperlinks are in $matches[1] (remember that $matches[0] contains all the matches). Let's initialize some arrays that we'll use to store and categorize the hyperlinks:
$all_links = array(); $js_links = array(); $full_links = array(); $local_links = array();
It's time to run through all of the links and do the real work of categorization. First, we make sure that we haven't already seen this link. If it's a new one, we'll use several regular expressions to determine what kind of link it is:
foreach ($matches[1] as $link) {
if ($all_links[$link]) {
continue;
}
$all_links[$link] = true;
if (preg_match('/^javascript:/', $link)) {
$js_links[] = $link;
} elseif (preg_match('/^https{0,1}:/i', $link)) {
$full_links[] = $link;
} else {
$local_links[] = $link;
}
}
Now it's time to print the results of the analysis:
print '<table border="0">'; print "<tr><td>number of links:</td><td>"; print strval(count($matches[1])) . "</td></tr>"; print "<tr><td>unique links:</td><td>"; print strval(count($all_links)) . "</td></tr>"; print "<tr><td>local links:</td><td>"; print strval(count($local_links)) . "</td></tr>"; print "<tr><td>full links:</td><td>"; print strval(count($full_links)) . "</td></tr>"; print "<tr><td>javascript junk:</td><td>"; print strval(count($js_links)) . "</td></tr>"; print '</table>'; ?>



