class_http.php

Author: Troy Wolf (troy@troywolf.com)
Modified Date: 2006-03-19
Download: class_http.zip
View class source: class_http.php source

Updated 3/19/2006 to fix problems posting variables and added code that resolved an issue that prevented accessing some SSL content.

Updated 3/6/2006 to support making WebDAV requests to Exchange Server.
Microsoft Exchange PHP WebDAV Examples

class_http.php is a "screen-scraping" utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. Caching makes you a good neighbor!

The class supports GET, POST, and even WebDAV verbs such as SEARCH, PROPFIND, PROPPATCH, and others.

The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

The class cloaks itself as the User Agent of the user making the request to your script. It also sends your script as the Referer, since in essence, it is the referrer. This means you should be able to screen-scrape sites that normally block screen-scraping. This class is not meant to help you break any company's usage policies. Be a good neighbor, and always use caching when you can.

Need to access protected content? The class can do basic authentication. However, a lot of sites that require login do not use basic authentication. No problem! You can post your credentials to a login script, retain the cookie, then access the protected content.

Solutions from Troy built on top of class_http

proxy.php
Exchange WebDAV
RSS Consumption
ServerBeach DNS API

Buy Troy a Latte

You wouldn't want me walking around decaffeinated would you? You can add any amount of money to my Latte Land card.

To use the class in your scripts, you first need to include the class file. Modify the path to fit your needs.

require_once(dirname(__FILE__).'/class_http.php');

Instantiate a new http object. You can create one object and use it over and over again throughout your script, or you can create multiple objects as needed.

$h = new http();

The caching feature requires a directory on your webserver to save the cache files. If you prefer, you can hard-code this in the class itself by modifying the 'dir' property in the http() function (the class constructor). The class will default to storing the cache files in the current directory, but for security, you should store them in a non web-accessible directory. You can set this property per object using the code below. You must end the value with a "/". If you do not plan to use caching, don't worry about this property.

$h->dir = "/home/foo/bar/";

Example to screen-scrape the Google home page without caching.

if (!$h->fetch("http://www.google.com")) {
  echo
"<h2>There is a problem with the http request!</h2>";
  echo
$h->log;
  exit();
}

Once you have executed the fetch() method, you have three properties available. The HTTP Status, HTTP headers, and body. Usually, you will only be interested in the body content.

echo "Status: ".$h->status;

echo
"<pre>".$h->header."</pre>";

echo
$h->body;

Here is an example to screen-scrape the MSFT stock page at moneycentral.com WITH caching. You can pass in a TTL which is a Time-To-Live in seconds that you want the cached data to be considered "good". For example, if you set the ttl to 600, it means that before going to the source site for the data, the local cache will be checked. If the cache file exists, and is not more than 10 minutes old, the class will use the cache. Otherwise, the source site will be scraped, and the local cache file will be updated. This makes subsequent hits to your page faster and makes you a better neighbor to the external site.

$url = "http://moneycentral.msn.com/detail/stock_quote?Symbol=MSFT";
if (!
$h->fetch($url, 600)) {
  echo
"<h2>There is a problem with the http request!</h2>";
  echo
$h->log;
  exit();
}

There is a special ttl value of "daily". This tells the class to consider the cached data "good" as long as it was scraped today. Otherwise, go get a fresh copy of content from the source site and update the local cache.

if (!$h->fetch($url, "daily")) {
  echo
"<h2>There is a problem with the http request!</h2>";
  echo
$h->log;
  exit();
}

Optionally, you can pass in a name that will be used to name the cache file. This is useful if you want to be able to know which cache files are which. If you do not pass a name, it will default to an MD5 hash of the url.

if (!$h->fetch($url, 600, "MSFT_Info")) {
  echo
"<h2>There is a problem with the http request!</h2>";
  echo
$h->log;
  exit();
}

The class comes with 2 static methods you can use to extract data out of HTML tables.

  1. table_into_array() will rip a single table into an array.
  2. table_into_xml() will internally call table_into_array() then create an XML document from the array. I thought this would be cool, but in practice, I've never used this method since the array is so easy to work with.
This example builds on the previous example to extract the MSFT stats out of http://moneycentral.msn.com/detail/stock_quote?Symbol=MSFT . Read the comments in the class file to learn how to use this static method.
$msft_stats = http::table_into_array($h->body, "Avg Daily Volume", 1, null);

/* Print out the array so you can see the stats data. */
echo "<pre>";
print_r($msft_stats);
echo
"</pre>";

The class can do basic authentication to scrape protected content. Note that most sites that require login do not use basic authentication. Pass your username and password in like this:

$url = "http://someprivatesite.net";
$h->fetch($url, 0, null, "MyUserName","MyPassword");

If your need to access content on a port other than 80 (or 443 for https), just put the port in the URL in the standard way:

$h->fetch("http://somedomain.org:8088");

The class includes a companion script named image_cache.php that can be used as the src attribute within an image element. Why not just link directly to a neighbor's images? If your site has a lot of traffic, that's a lot of hits to your neighbor's site. So why not just copy their image to your own server? That's fine for images that do not change, but some sites create dynamic images such as stock charts that are generated new every minute. image_cache.php in conjunction with class_http.php makes it easy to directly link to third-party images and cache the image data for whatever TTL makes sense for your application. View the source for image_cache.php.

In this example, we will cache the chart image found at http://moneycentral.msn.com/investor/charts/chartdl.asp?FC=1&Symbol=MSFT&CA=1&CB=1&CC=1&CD=1&CP=0&PT=5 You have to look at the page source code to find the url to their image. Then you url encode their image URL, and pass it as a parameter to image_cache.php in your image's src attribute. The embedded URL is very long because it was long to start with, and after URL encoding, it is much longer. In this example, we have set ttl=60 which means cache the image for 1 minute before hitting the source site again.

<img src="image_cache.php?ttl=60&url=http%3A%2F%2Fdata.moneycentral.msn.com%2Fscripts%2Fchrtsrv.dll%3FSymbol
%3DMSFT%26C1%3D0%26C2%3D1%26C9%3D2%26CA%3D1%26CB%3D1%26CC%3D1%26CD%3D1%26CF%3D0%26EFR%3D236%26EFG%3D246%26EFB
%3D254%26E1%3D0"
width="448" height="300" alt="Chart Graphic" />
Tip: Use PHP's urlencode() function to encode your embedded URLs.

Finally, anytime you have problems, be sure to look at the 'log' property which will give you specific information related to problems with your http requests or problems with caching.

/*
The log property contains a log of the objects events. Very useful for
testing and debugging. If there are problems, the log will tell you what
is wrong. For example, if the cache dir specified does not have write privs,
the log will tell you it could not open the cache file. If a socket to the remote server
could not be opened, the log will tell you this.
*/
echo "<h1>Log</h1>";
echo
$h->log;


About the author

Troy Wolf operates ShinySolutions Webhosting, and is the author of SnippetEdit--a PHP application providing browser-based website editing that even non-technical people can use. Website editing as easy as it gets. Troy has been a professional Internet and database application developer for over 12 years. He has many years' experience with ASP, VBScript, PHP, Javascript, DHTML, CSS, SQL, and XML on Windows and Linux platforms. Check out Troy's Code Library.