twitter’s streaming API is still in beta and is a good source of collecting public tweets. but unfortunately not all those methods are instantly usable by third parties (u need to provide written statements and so on). but for testing, three of these streaming APIs are usable by anyone at this moment which are spritzer, track and follow. spritzer streams a tiny part of public tweets to the collecting processes. in this blog post i’ll show you how to collect data from spritzer API.
as it is a stream data, so twitter keeps the HTTP connection “alive” infinitely (until any hiccup, by using Keep Alive). so when you write code, you must take care of that. and i would also suggest to make separate processes for collecting data+writing them (or sending them in queue to be written) – and for analyzing those data. and of course, to minimize the bandwidth consumption, use the json format. and json data is also easier to parse than XML as every tweet is separated by a new line (“\n”) character from twitter 🙂 – so you can read these data line by line, dcode them using json_decode() and do whatever you want
here is how you can create the collector process in php
< ?php //datacollector.php $fp = fopen("http://username:password@stream.twitter.com/spritzer.json","r"); while($data = fgets($fp)) { $time = date("YmdH"); if ($newTime!=$time) { @fclose($fp2); $fp2 = fopen("{$time}.txt","a"); } fputs($fp2,$data); $newTime = $time; } ?>
this script will write the data collected hourly from the spritzer streaming API in filen (with names like <YmdH>.txt ). so in the directory where you are runnign this script u will see hourly data files. like 2009062020.txt . there is a special advantage to keep collecting in this way – as the file will remain open for writing (hence LOCKED) you will process files only for previous hours. it will make analyzing the data more hassle free 🙂
now run this script in background via the following command in your terminal
php datacollector.php &
the reason for appending an “&’ at the end of the command is starting this process in background. so that you dont have to wait for the script to end to get access to your shell back. as it is a streaming data, the script will run infinitely. and it will consume very minimal bandwidth 🙂 you can check yourself.
so i hope it will help those developers who are looking for a solution to collect data from twitter’s streaming API via PHP. If you want to track any specific keywords, use the “track” API instead :). and if you want to follow some particular person use the “follow“. Check out twitter’s documentation of streaming API for more 🙂
Another post on my interest 🙂
Thanks millions!
nice , but i was looking for how to collect a group of followers , waiting for that post ,
Please make it fast .
short and sweet. just too good. can’t wait until they make the streaming live. thanks for the post.
@Ishtiaque – the spritzer, follow and track APIs are open for all 🙂 others are not available for everyone.
The tremendous “Firehose” API is avilable to friendfeed fyi 🙂 thats why they got all your tweets
thanks for the info Hasin bhai. That’s really cool. One more question, aren’t they going to use the oauth. I thot twitter has set it as the standard way to access their APIs.
@Ishtiaque yes of course they use oAuth. but i’ve shown the shortcut way by HTTP BASIC AUTH. Because this process will run in background as a shell process and you must use your account for that (which means u know un/pw) – and I dont know (and confused if there is even anything exist like it) how to use oAuth tokens for CLI 🙂
But yes, you can do it with oAuth token. Check my previous blog post to get idea how to implement oAuth using PHP in twitter 🙂
there is something similar available called tweetstream (http://www.tweetstream.us) but Im not entirely sure how it plays in.
Awesome!
Can you provide an example in case the data stream is interrupted and you have to re-connect?
Also, how would you do the “track” stream using parameters? Does fopen allow for sending of the paramters?
Or do I need to use cURL for “track”.
Thanks
Superb….post…
This runs great, and really appreciate the post. Been having problems with it this week though, keeps dropping off, anybody have a solution to auto restart it when it does?
To those asking how to make it re-start should the twitter API fail… use this…
while(‘1’ == ‘1’) {
$fp = fopen(“http://username:password@stream.twitter.com/spritzer.json”,”r”);
while($data = fgets($fp)) {
$time = date(“YmdH”);
if ($newTime!=$time) {
@fclose($fp2);
$fp2 = fopen(“{$time}.txt”,”a”);
}
fputs($fp2,$data);
$newTime = $time;
}
sleep(1000);
}
It is a little hackish… but its simple, and it works.
Yes the sleep is needed or else you could get a temp IP block from twitter…
Thanks for posting this… to give some back — those wondering how to use track and follow, see below:
$curl = curl_init();
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, ‘track=#NowPlaying’);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, ‘http://stream.twitter.com/1/statuses/filter.json’);
curl_setopt($curl, CURLOPT_USERPWD, $_CONFIG[‘twitter’][‘username’] . ‘:’ . $_CONFIG[‘twitter’][‘password’]);
curl_setopt($curl, CURLOPT_WRITEFUNCTION, ‘progress’);
curl_exec($curl);
curl_close($curl);
function progress($curl, $str)
{
print “$str\n\n”;
return strlen($str);
}
There are a few libraries around now that make consuming the stream a lot easier – You can find them on the Twitter API wiki libraries page:
http://apiwiki.twitter.com/Libraries
Including my own library, Phirehose (PHP): http://code.google.com/p/phirehose/ which handles all the new filter methods (ie: track, follow and geo-located tweets) plus all the messiness of auth, reconnecting with TCP failure backoff, etc, etc.
Enjoy your streaming!
Fenn.
Hi, somebody knows :
How many people use a specific hashtag(#twitter,#soccer….)?
Christian
Anybody who can tell me how to adapt the above code of datacollector.php to work with the new twitter api. Please help sorry if the question is simple but am very new to that. Thanks in advance
Nagy
this is not working at all