June 20, 2009June 21, 2009

collecting data from streaming APIs in twitter

twitter’s streaming API is still in beta and is a good source of collecting public tweets. but unfortunately not all those methods are instantly usable by third parties (u need to provide written statements and so on). but for testing, three of these streaming APIs are usable by anyone at this moment which are spritzer, track and follow. spritzer streams a tiny part of public tweets to the collecting processes. in this blog post i’ll show you how to collect data from spritzer API.

as it is a stream data, so twitter keeps the HTTP connection “alive” infinitely (until any hiccup, by using Keep Alive). so when you write code, you must take care of that. and i would also suggest to make separate processes for collecting data+writing them (or sending them in queue to be written) – and for analyzing those data. and of course, to minimize the bandwidth consumption, use the json format. and json data is also easier to parse than XML as every tweet is separated by a new line (“\n”) character from twitter 🙂 – so you can read these data line by line, dcode them using json_decode() and do whatever you want

here is how you can create the collector process in php

< ?php
//datacollector.php
$fp = fopen("http://username:password@stream.twitter.com/spritzer.json","r");
while($data = fgets($fp))
{
    $time = date("YmdH");
    if ($newTime!=$time)
    {
        @fclose($fp2);
        $fp2 = fopen("{$time}.txt","a");
    }
    fputs($fp2,$data);
    $newTime = $time;
}
?>

this script will write the data collected hourly from the spritzer streaming API in filen (with names like <YmdH>.txt ). so in the directory where you are runnign this script u will see hourly data files. like 2009062020.txt . there is a special advantage to keep collecting in this way – as the file will remain open for writing (hence LOCKED) you will process files only for previous hours. it will make analyzing the data more hassle free 🙂

now run this script in background via the following command in your terminal

php datacollector.php &

the reason for appending an “&’ at the end of the command is starting this process in background. so that you dont have to wait for the script to end to get access to your shell back. as it is a streaming data, the script will run infinitely. and it will consume very minimal bandwidth 🙂 you can check yourself.

so i hope it will help those developers who are looking for a solution to collect data from twitter’s streaming API via PHP. If you want to track any specific keywords, use the “track” API instead :). and if you want to follow some particular person use the “follow“. Check out twitter’s documentation of streaming API for more 🙂

21 thoughts on “collecting data from streaming APIs in twitter”

Lenin says:

June 20, 2009 at 11:49 pm

Another post on my interest 🙂

Thanks millions!

Reply
ranacse05 says:

June 20, 2009 at 11:56 pm

nice , but i was looking for how to collect a group of followers , waiting for that post ,

Please make it fast .

Reply
Ishtiaque Ahmed says:

June 21, 2009 at 1:01 am

short and sweet. just too good. can’t wait until they make the streaming live. thanks for the post.

Reply
hasin says:

June 21, 2009 at 1:04 am

@Ishtiaque – the spritzer, follow and track APIs are open for all 🙂 others are not available for everyone.

The tremendous “Firehose” API is avilable to friendfeed fyi 🙂 thats why they got all your tweets

Reply
Ishtiaque Ahmed says:

June 21, 2009 at 1:21 am

thanks for the info Hasin bhai. That’s really cool. One more question, aren’t they going to use the oauth. I thot twitter has set it as the standard way to access their APIs.

Reply
hasin says:

June 21, 2009 at 1:25 am

@Ishtiaque yes of course they use oAuth. but i’ve shown the shortcut way by HTTP BASIC AUTH. Because this process will run in background as a shell process and you must use your account for that (which means u know un/pw) – and I dont know (and confused if there is even anything exist like it) how to use oAuth tokens for CLI 🙂

But yes, you can do it with oAuth token. Check my previous blog post to get idea how to implement oAuth using PHP in twitter 🙂

Reply
adbox says:

June 21, 2009 at 2:56 am

there is something similar available called tweetstream (http://www.tweetstream.us) but Im not entirely sure how it plays in.

Reply
Pingback: Investigative journalism on 22 June 09 « The Centre for Investigative Journalism News Blog
Hardeep Khehra says:

June 27, 2009 at 4:27 am

Awesome!

Can you provide an example in case the data stream is interrupted and you have to re-connect?

Also, how would you do the “track” stream using parameters? Does fopen allow for sending of the paramters?

Or do I need to use cURL for “track”.

Thanks

Reply
arunachalam says:

June 29, 2009 at 12:10 pm

Superb….post…

Reply
David says:

July 9, 2009 at 6:17 pm

This runs great, and really appreciate the post. Been having problems with it this week though, keeps dropping off, anybody have a solution to auto restart it when it does?

Reply
Pingback: Phirehose track filtering works « Tweetalysis Blog
Pingback: The Baydin Blog » VirtualBox Image for Enron Email and Twitter Data Analysis
Matt Apperson says:

January 2, 2010 at 1:02 am

To those asking how to make it re-start should the twitter API fail… use this…
while(‘1’ == ‘1’) {
$fp = fopen(“http://username:password@stream.twitter.com/spritzer.json”,”r”);
while($data = fgets($fp)) {
$time = date(“YmdH”);
if ($newTime!=$time) {
@fclose($fp2);
$fp2 = fopen(“{$time}.txt”,”a”);
}
fputs($fp2,$data);
$newTime = $time;
}
sleep(1000);
}

It is a little hackish… but its simple, and it works.
Yes the sleep is needed or else you could get a temp IP block from twitter…

Reply
Pingback: Twitter Streaming API with Python « maSnun.com
Langdon says:

January 7, 2010 at 9:34 pm

Thanks for posting this… to give some back — those wondering how to use track and follow, see below:

$curl = curl_init();

curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, ‘track=#NowPlaying’);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, ‘http://stream.twitter.com/1/statuses/filter.json’);
curl_setopt($curl, CURLOPT_USERPWD, $_CONFIG[‘twitter’][‘username’] . ‘:’ . $_CONFIG[‘twitter’][‘password’]);
curl_setopt($curl, CURLOPT_WRITEFUNCTION, ‘progress’);
curl_exec($curl);
curl_close($curl);

function progress($curl, $str)
{
print “$str\n\n”;
return strlen($str);
}

Reply
Fenn says:

January 22, 2010 at 7:58 am

There are a few libraries around now that make consuming the stream a lot easier – You can find them on the Twitter API wiki libraries page:

http://apiwiki.twitter.com/Libraries

Including my own library, Phirehose (PHP): http://code.google.com/p/phirehose/ which handles all the new filter methods (ie: track, follow and geo-located tweets) plus all the messiness of auth, reconnecting with TCP failure backoff, etc, etc.

Enjoy your streaming!

Fenn.

Reply
Pingback: The Baydin Blog | Email, Startups, and Search » VirtualBox Image for Enron Email and Twitter Data Analysis
Christian says:

August 11, 2010 at 4:09 pm

Hi, somebody knows :

How many people use a specific hashtag(#twitter,#soccer….)?

Christian

Reply
Nagy says:

May 31, 2011 at 5:36 pm

Anybody who can tell me how to adapt the above code of datacollector.php to work with the new twitter api. Please help sorry if the question is simple but am very new to that. Thanks in advance
Nagy

Reply
Diego says:

May 2, 2012 at 3:21 pm

this is not working at all

Reply