Web scrapping in a smart way, making a “Today in History” object in PHP

There are thousands of services available on web who are presenting interesting as well as education information which you can really integrate in your web page or make a nice widget and let others use them seamlessly with their content delivery platforms. In this article I am going to show you how you can make a nice Today-in-History widget with the help of the data provided in Scopesys. You can use this code to make a nice widget or a trivia app or whatever. But before making your own scrappers from any services, please please please carefully note the copyright of that content. You shouldn’t violate copyright either way.

In this widget, we will strip the following content from the pages provided by scopesys and display them in different categories.
1. Today in history
2. Who’s born today
3. Who’s died today
4. Where is holiday today
5. Religious observance of today
6. Religious history of today

Lets go 😀

<?php //todayinhistory.php error_reporting(0); define("MARKER_START","<H3>On this day...</h3>"); define("MARKER_END","<BR><BR><HR><h3>Holidays</h3>"); define("BIRTHDAY_START","</font></center></center>"); define("BIRTHDAY_END","<HR> <br><H3>Deaths which occurred on ".date("F d").":</H3>"); define("DEATH_START","<HR> <br><H3>Deaths which occurred on ".date("F d").":</H3>"); define("DEATH_END","<HR><IMG align=left SRC=\"http://www.scopesys.com/flag.gif\">"); define("HOLIDAYS_START",'<i>Note: Some Holidays are only applicable on a given <b>"day of the week"</b></i><br> <br>'); define("HOLIDAYS_END","<HR> <H3>Religious Observances</H3>"); define("RELIGIOUS_START","<HR> <H3>Religious Observances</H3>"); define("RELIGIOUS_END","<HR> <H3>Religious History </h3>"); define("RELHISTORY_START","<HR> <H3>Religious History </h3>"); define("RELHISTORY_END","<BR><BR><font color=red>");

echo "<h2>Today is ".Date("F d, Y")."</h2>"; $data = file_get_contents("http://www.scopesys.com/today");

if ($_GET['history']=='1'){ echo "<br/><h2 style='color: green' >Today in history</h2>"; $end = strpos($data,MARKER_END)-15; $start = strpos($data,MARKER_START)+strlen(MARKER_START); echo substr($data,$start,$end-$start); }

if ($_GET['born']=='1'){ echo "<br/><h2 style='color: green' >Who's born today</h2>"; $end = strpos($data,BIRTHDAY_END); $start = strpos($data,BIRTHDAY_START)+strlen(BIRTHDAY_START); echo substr($data,$start,$end-$start); }

if ($_GET['died']=='1'){ echo "<br/><h2 style='color: green' >Who died today</h2>"; $end = strpos($data,DEATH_END); $start = strpos($data,DEATH_START)+strlen(DEATH_START); echo substr($data,$start,$end-$start); }

if ($_GET['holiday']=='1'){ echo "<br/><h2 style='color: green' >Where is holiday today</h2>"; $end = strpos($data,HOLIDAYS_END); $start = strpos($data,HOLIDAYS_START)+strlen(HOLIDAYS_START); echo substr($data,$start,$end-$start); }

if ($_GET['religious']=='1'){ echo "<br/><h2 style='color: green' >Religious observance</h2>"; $end = strpos($data,RELIGIOUS_END); $start = strpos($data,RELIGIOUS_START)+strlen(RELIGIOUS_START); echo substr($data,$start,$end-$start); }

if ($_GET['relhistory']=='1'){ echo "<br/><h2 style='color: green' >Religious history</h2>"; $end = strpos($data,RELHISTORY_END); $start = strpos($data,RELHISTORY_START)+strlen(RELHISTORY_START); echo substr($data,$start,$end-$start); } ?>

Now if you want to find who born today, point your browser to todayinhistory.php?born=1. Mashup Mashup Mashup, that is what many successful web app are doing these days. And sometime this is how data collection is done behind the scene 🙂

Writing this code was really enjoyable as getting root canal done in your teeth with a rusty drill (I forgot where I’ve read such a nice quote), heh heh. But I am sure, you will enjoy it more than that 😉 – happy scrapping.

12 thoughts on “Web scrapping in a smart way, making a “Today in History” object in PHP”

But wasn’t Rss invented to avoid doing this?

@Piccolo

yeah, I never found a RSS chanel for Today-in-so-many-categories. In these cases, screen scrapping saves the day 🙂

hint(challenge): tidy, simplexml and some xpath, and you can have a script with 10-15 lines 🙂

In line with aurelian’s comments, using dom/xpath could really save you some hassles. With xpath you can query on attributes for given tags, etc which make your script a bit less dependent on the page content (you can’t get zero dependence but it minimizes it). You’d find your script a bit shorter.

Another suggestion would be to cache the results using Zend_Cache or PEAR’s Cache_Lite.

@Tony, Aurelian

True. And sometime dom is more efficient than scrapping like this. But did you ever try to make a scrapper which parse all the linked in page after simulating POST ? Or some of the popular job sites? You will understand the real pain. In those cases, this policy works best.

And yeah, I forgot to mention about caching. We should cache the page to avoid DOS to that service. Thanks

Thanks.

Nice code snipped

i don’t know how i missed this wonderful Article to read !
Really opened my mind to think in this way,

thanks Hasin bhai 🙂

Pingback: 網站製作學習誌 » [Web] 連結分享

Hi there Hasin. Thanks for posting the script as I’ve been searching for an example like it to scrap specific element within a page. Given the fact it was written two years ago, when scraping with this app all I retrieve is the Date with nothing else echoing out. I think Scopesys have changed their source code, apart from that the code’s function seems fine.

Pingback: 網頁剪輯應用實例──用 PHP 製作一個「當年今日」的物件 | 香港網頁開發網誌

Pingback: 網頁剪輯應用實例──用 PHP 製作一個「當年今日」的物件 « 香港網頁開發網誌