twitter’s streaming API is still in beta and is a good source of collecting public tweets. but unfortunately not all those methods are instantly usable by third parties (u need to provide written statements and so on). but for testing, three of these streaming APIs are usable by anyone at this moment which are spritzer, track and follow. spritzer streams a tiny part of public tweets to the collecting processes. in this blog post i’ll show you how to collect data from spritzer API.
as it is a stream data, so twitter keeps the HTTP connection “alive” infinitely (until any hiccup, by using Keep Alive). so when you write code, you must take care of that. and i would also suggest to make separate processes for collecting data+writing them (or sending them in queue to be written) – and for analyzing those data. and of course, to minimize the bandwidth consumption, use the json format. and json data is also easier to parse than XML as every tweet is separated by a new line (“\n”) character from twitter
– so you can read these data line by line, dcode them using json_decode() and do whatever you want
here is how you can create the collector process in php
< ?php
//datacollector.php
$fp = fopen("http://username:password@stream.twitter.com/spritzer.json","r");
while($data = fgets($fp))
{
$time = date("YmdH");
if ($newTime!=$time)
{
@fclose($fp2);
$fp2 = fopen("{$time}.txt","a");
}
fputs($fp2,$data);
$newTime = $time;
}
?>
this script will write the data collected hourly from the spritzer streaming API in filen (with names like <YmdH>.txt ). so in the directory where you are runnign this script u will see hourly data files. like 2009062020.txt . there is a special advantage to keep collecting in this way – as the file will remain open for writing (hence LOCKED) you will process files only for previous hours. it will make analyzing the data more hassle free
now run this script in background via the following command in your terminal
php datacollector.php &
the reason for appending an “&’ at the end of the command is starting this process in background. so that you dont have to wait for the script to end to get access to your shell back. as it is a streaming data, the script will run infinitely. and it will consume very minimal bandwidth
you can check yourself.
so i hope it will help those developers who are looking for a solution to collect data from twitter’s streaming API via PHP. If you want to track any specific keywords, use the “track” API instead
. and if you want to follow some particular person use the “follow“. Check out twitter’s documentation of streaming API for more














12 responses so far ↓
Lenin // June 20, 2009 at 11:49 pm |
Another post on my interest
Thanks millions!
ranacse05 // June 20, 2009 at 11:56 pm |
nice , but i was looking for how to collect a group of followers , waiting for that post ,
Please make it fast .
Ishtiaque Ahmed // June 21, 2009 at 1:01 am |
short and sweet. just too good. can’t wait until they make the streaming live. thanks for the post.
hasin // June 21, 2009 at 1:04 am |
@Ishtiaque – the spritzer, follow and track APIs are open for all
others are not available for everyone.
The tremendous “Firehose” API is avilable to friendfeed fyi
thats why they got all your tweets
Ishtiaque Ahmed // June 21, 2009 at 1:21 am |
thanks for the info Hasin bhai. That’s really cool. One more question, aren’t they going to use the oauth. I thot twitter has set it as the standard way to access their APIs.
hasin // June 21, 2009 at 1:25 am |
@Ishtiaque yes of course they use oAuth. but i’ve shown the shortcut way by HTTP BASIC AUTH. Because this process will run in background as a shell process and you must use your account for that (which means u know un/pw) – and I dont know (and confused if there is even anything exist like it) how to use oAuth tokens for CLI
But yes, you can do it with oAuth token. Check my previous blog post to get idea how to implement oAuth using PHP in twitter
adbox // June 21, 2009 at 2:56 am |
there is something similar available called tweetstream (http://www.tweetstream.us) but Im not entirely sure how it plays in.
Investigative journalism on 22 June 09 « The Centre for Investigative Journalism News Blog // June 22, 2009 at 1:39 pm |
[...] collecting data from streaming APIs in twitter « The Storyteller [...]
Hardeep Khehra // June 27, 2009 at 4:27 am |
Awesome!
Can you provide an example in case the data stream is interrupted and you have to re-connect?
Also, how would you do the “track” stream using parameters? Does fopen allow for sending of the paramters?
Or do I need to use cURL for “track”.
Thanks
arunachalam // June 29, 2009 at 12:10 pm |
Superb….post…
David // July 9, 2009 at 6:17 pm |
This runs great, and really appreciate the post. Been having problems with it this week though, keeps dropping off, anybody have a solution to auto restart it when it does?
Phirehose track filtering works « Tweetalysis Blog // November 4, 2009 at 8:14 am |
[...] is the easiest way but that took me a long time to implement. Currently I’m trying to adapt this slightly out of date code so that new files are generated for each hour. My code isn’t throwing any errors but it [...]