Building a Twitter Filter With Sinatra, Redis, and TweetStream

08 Nov 2009

It's been way too long since my last programming focused blog post, so let's try to rectify this situation:

Background

A couple months ago, Twitter made available their Streaming API. This provides developers with a very efficient way to tap into the public Twitter stream. All you need to do is open and maintain a single HTTP connection, passing in a few filter parameters. Twitter then keeps streaming matching tweets to you. You have the option of either sampling the entire public stream, or passing in a list of keywords and / or user ids to track. In this post I will focus on the latter, but the basic usage remains the same.

My interest was piqued when I came across the excellent TweetStream library, which makes it trivially easy to write a Ruby client application for the Twitter Streaming API. I decided to take this opportunity to play with some other technologies and write a simple web app that displays a subset of tweets, along the lines of cursebird or twistori.

The app I came up with is Twatcher, so go check it out to get an idea of what I'm talking about. It (admittedly very crudely) identifies funny tweets by looking for tweets that contain the word "lol". It then renders matching tweets using a simple UI not unlike that of twitter.com itself, and visually highlights the word "lol" in each tweet for emphasis. Perhaps most importantly, the app uses AJAX to periodically (currently every 10 seconds) pull in new tweets.

Architecture

In the remainder of this post, I will describe the architecture of Twatcher, along with the rationale behind it. I will also share some code snippets that should allow you to follow along and build your own Twitter filter app.

Given that the connection to the Twitter Streaming API has to happen in its own application (let's call it our filter app), outside of the actual web app, we need a way for it to pass tweets to the web app. There are a couple of options here. Obviously we could store the tweets in a database like MySQL, and have the web app read them from there. But given the small schema (only tweets; we don't care about users or any other relational data) and the ephemeral nature of the Twatcher app (at any given time we really only care about the N most recent tweets), this seems like overkill and leads to an unnecessarily write-heavy app. Instead, one of the various in-memory key/value stores seems like a much better fit. I first thought of memcached, but while it would be entirely possible to build this type of app on top of memcached, it's not ideal. When a new tweet comes in from the Streaming API, we need to append it to our in-memory list of tweets. Memcached is a very low-level data store and only supports string values, so we would have to implement lists by serializing them as YAML, JSON, or binary Ruby objects. Either way would mean that instead of writing just the latest tweet, we would always have to re-write the entire list of tweets. Similarly, on the web app side, we would always have to read the entire list, even though we may only be interested in the 5 most recent tweets (say in our AJAX action). Combined, this would lead to a fair amount of overhead on both the networking as well as Ruby processing side.

Luckily, there's another data store that is perfect for this type of app: Redis. On the most basic level it can be thought of as a key/value store like memcached, so it can act pretty much like a drop-in replacement for this. But it also has first class support for basic data structures, such as sets and lists. This means that instead of reading and writing entire lists of tweets, we can append a single tweet at a time, and we can efficiently retrieve the exact number of tweets that we need on the web app side (i.e. 20 for a full page view, and 5 for an AJAX request). Redis is stable, highly performant, and has a solid, extremely easy to use Ruby library. It also supports basic persistence, although we won't need this for our app.

With the filter app and data store out of the way, that leaves the actual web app. Our requirements are very humble: We will only have two actions (one for the full page of tweets and one for AJAX updates), and perhaps a few more trivial actions in the future for things like help pages. Since we're not using a relational database, we don't need any sort of ORM layer. While we could use Ruby on Rails, this would mean shooting sparrows with cannons. For our purposes the Sinatra micro-framework seems like a much better fit.

I'm a big fan of HAML, so we'll use this for our views. Of course there's nothing HAML specific about our app, so you're welcome to use ERB or your template language of choice instead.

We'll use jQuery as our Javascript library, mainly for AJAX requests and basic visual effects (so we can smoothly slide new tweets into the existing page). Once more, our needs are simple, and I'm sure any of the popular Javascript frameworks would be more than up to the challenge. But my personal preference is jQuery.

I won't go much into the deployment side of things, but twatcher.com relies on the usual suspects: Nginx (Apache would work fine as well), Passenger (you're welcome to use Mongrel, Thin, etc.), Capistrano, and God (to start and monitor Redis and our filter app, though I may end up giving Bluepill a try). All of this runs very smoothly on a 256MB VPS slice on Webbynode (and I'm sure just as well on Slicehost or Linode). If necessary, we could easily scale up this app by bringing up additional Sinatra slices and adding HAProxy to the mix (or perhaps even just relying on DNS round robin).

Code

Now that the architectural overview is done, let's take a look at some of the code. This isn't the complete code base that I'm using on the site, but it's a fully functional subset and hopefully enough to demonstrate the overall approach and get you started. Alternatively, you can grab the code from the twatcher-lite Github repository. I will eventually make the complete project (which includes configuration options, RSpec specs, etc.) available on Github as well.

But first a couple of prerequisites: We need to install a bunch of gems. For a production app, I would typically unpack these into the vendor directory, but for now let's just install them system-wide:

sudo gem install tweetstream yajl-ruby ezmobius-redis json haml rack sinatra shotgun

You also need to install and start Redis. It's easy enough, but beyond the scope of this blog post. Simply follow these instructions (I highly recommend the entire Redis article series by the way), but make sure to use the latest Redis release from the official website (currently 1.02) rather than the 1.0 version mentioned in the article.

Filter App

twitter_filter.rb:

This is the standalone filter app that mainly relies on the TweetStream library to retrieve tweets and then pushes them to Redis. In our final app we would want to use the Daemons library to run this app as a proper daemon, but for now you should be able to simply run it directly from the command line. Note that it relies on two additional files below. Simply place all of these into the same folder.

Make sure to set USERNAME and PASSWORD to your actual Twitter credentials. A word of caution: Apparently Twitter only allows a single Streaming API connection for standard accounts, and they will disconnect or blacklist you if you attempt to start multiple connections. I'm using a dedicated Twitter account for production, and my regular Twitter account during development. The actual version of this file that I'm using reads the (environment specific) credentials from a YAML file, but I didn't want to distract from the core functionality for the purpose of this tutorial.

require 'tweetstream'
require File.join(File.dirname(__FILE__), 'tweet_store')

USERNAME = "my_username"  # Replace with your Twitter user
PASSWORD = "my_password"  # and your Twitter password
STORE = TweetStore.new

TweetStream::Client.new(USERNAME, PASSWORD).track('lol') do |status|
  # Ignore replies. Probably not relevant in your own filter app, but we want
  # to filter out funny tweets that stand on their own, not responses.
  if status.text !~ /^@\w+/
    # Yes, we could just store the Status object as-is, since it's actually just a
    # subclass of Hash. But Twitter results include lots of fields that we don't
    # care about, so let's keep it simple and efficient for the web app.
    STORE.push(
      'id' => status[:id],
      'text' => status.text,
      'username' => status.user.screen_name,
      'userid' => status.user[:id],
      'name' => status.user.name,
      'profile_image_url' => status.user.profile_image_url,
      'received_at' => Time.new.to_i
    )
  end
end

tweet_store.rb:

This is a thin abstraction layer on top of Redis that encapsulates both pushing and retrieving tweets. This allows us to keep Redis specific persistence code out of the filter and web apps and also comes in handy for testing (which I'm not getting into in this post), as we can easily swap it out for a mock implementation.

Note how we're using the push_head operation to push a single tweet to Redis, and list_range to retrieve the N most recent tweets.

require 'json'
require 'redis'
require File.join(File.dirname(__FILE__), 'tweet')

class TweetStore
  
  REDIS_KEY = 'tweets'
  NUM_TWEETS = 20
  TRIM_THRESHOLD = 100
  
  def initialize
    @db = Redis.new
    @trim_count = 0
  end
  
  # Retrieves the specified number of tweets, but only if they are more recent
  # than the specified timestamp.
  def tweets(limit=15, since=0)
    @db.list_range(REDIS_KEY, 0, limit - 1).collect {|t|
      Tweet.new(JSON.parse(t))
    }.reject {|t| t.received_at <= since}  # In 1.8.7, should use drop_while instead
  end
  
  def push(data)
    @db.push_head(REDIS_KEY, data.to_json)
  
    @trim_count += 1
    if (@trim_count > 100)
      # Periodically trim the list so it doesn't grow too large.
      @db.list_trim(REDIS_KEY, 0, NUM_TWEETS)
      @trim_count = 0
    end
  end
  
end

tweet.rb:

The Tweet class wraps an individual tweet's data hash and allows us to access the data using method call syntax (tweet.username) rather than hash element references (tweet['username']). It also contains some tweet related functionality, such as generating Twitter user links, highlighting the word "lol", and making URLs clickable.

class Tweet
  
  def initialize(data)
    @data = data
  end

  def user_link
    "http://twitter.com/#{username}"
  end

  # Makes links clickable, highlights LOL, etc.
  def filtered_text
    filter_lol(filter_urls(text))
  end

  private
  
  # So we can call tweet.text instead of tweet['text']
  def method_missing(name)
    @data[name.to_s]
  end
  
  def filter_lol(text)
    # Note that we're using a list of characters rather than just \b to avoid
    # replacing LOL inside a URL.
    text.gsub(/^(.*[\s\.\,\;])?(lol)(\b)/i, '\1<span class="lol">\2</span>\3')
  end
  
  def filter_urls(text)
    # The regex could probably still be improved, but this seems to do the
    # trick for most cases.
    text.gsub(/(https?:\/\/\w+(\.\w+)+(\/[\w\+\-\,\%]+)*(\?[\w\[\]]+(=\w*)?(&\w+(=\w*)?)*)?(#\w+)?)/i, '<a href="\1">\1</a>')
  end
  
end

Web App

twatcher.rb:

This is the actual Sinatra web app. This is the entire app (minus the views), so perhaps now you can see why we're using Sinatra instead of a full-blown Rails app. The views follow below.

Note that our two actions both return tweets. The main difference is that the /latest action (which is used by AJAX requests) only returns up to 5 tweets, and only if they're newer than the specified date. It also omits the layout and specifies a special CSS class named latest. This allows us to initially hide the new tweets and then make them visible using a nice Javascript slide effect.

require 'sinatra'
require 'haml'
require File.join(File.dirname(__FILE__), 'tweet_store')

STORE = TweetStore.new

get '/' do
  @tweets = STORE.tweets
  haml :index
end

get '/latest' do
  # We're using a Javascript variable to keep track of the time the latest
  # tweet was received, so we can request only newer tweets here. Might want
  # to consider using Last-Modified HTTP header as a slightly cleaner
  # solution (but requires more jQuery code).
  @tweets = STORE.tweets(5, (params[:since] || 0).to_i)
  @tweet_class = 'latest'  # So we can hide and animate
  haml :tweets, :layout => false
end

views/layout.haml:

A pretty simple layout.

!!! Strict
%html{:xmlns=> "http://www.w3.org/1999/xhtml", 'xml:lang' => "en", :lang => "en"}
  %head
    %meta{'http-equiv' => "Content-Type", 'content' => "text/html; charset=utf-8"}
    %title twatcher
    %link{:rel => 'stylesheet', :href => '/stylesheets/style.css', :type => 'text/css'}
    %script{:type => 'text/javascript', :src => 'http://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js'}
  %body
    #container
      #content
        = yield

views/index.haml:

The actual HTML content is pretty minimal: A heading and a list of tweets, which is included from a separate file (below) so we can reuse it for the AJAX action. We're also inlining some jQuery code to refresh the tweets every 10 seconds. We insert the new tweets at the beginning, but remember that we're using a CSS class to initially hide them. We then call slideDown to make them visible using a nice slide effect. We also trim the list of tweets at 50 to prevent the page from getting too long.

:javascript
  function refreshTweets() {
    $.get('/latest', {since: window.latestTweet}, function(data) {
      $('.tweets').prepend(data);
      $('.latest').slideDown('slow');
      $('.tweets li:gt(50)').remove();
      
      setTimeout(refreshTweets, 10000);
    });
  }
  $(function() {
    setTimeout(refreshTweets, 10000);
  });
  
%h1 Recent LOL Tweets
%ul.tweets
  = haml :tweets, :layout => false

views/tweets.haml:

Simply renders a list item for each tweet, with some basic CSS for styling purposes. I wouldn't normally hardcode height and width for an img tag (and instead let CSS handle this), but for the purpose of this tutorial I wanted the page to render decently without a style sheet, and the Twitter profile pictures can be pretty large, making it look weird.

We also emit some simple Javascript to record the timestamp of the most recent tweet. We pass this into our AJAX request in index.haml above. An alternative solution (perhaps slightly cleaner from an HTTP perspective) would be to use the Last-Modified HTTP header instead of a Javascript variable, but this would mean messing with date parsing (never fun...) and also result in slightly more complex jQuery code, so I opted for the simpler solution.

- @tweets.each do |tweet|
  %li.tweet{:class => @tweet_class}
    %span.avatar
      %a{:href => tweet.user_link}
        %img{:src => tweet.profile_image_url, :alt => tweet.username, :height => 48, :width => 48}
    %span.main
      %span.text= tweet.filtered_text
      %span.meta== &#8212; #{tweet.name} (<a href="#{tweet.user_link}">@#{tweet.username}</a>)

- if !@tweets.empty?
  :javascript
    window.latestTweet = #{@tweets[0].received_at};

public/stylesheets/style.css:

The stylesheet is pretty basic, but since this blog post is already way too long, I'm not going to reproduce it here. Simply grab the live one instead.

Putting it all together

You should now have a bunch of Ruby files in the same folder, and three HAML files in a views subdirectory. Make sure you have started Redis according to the instructions above. Then open two shells:

In your first shell, start the filter app:

ruby twitter_filter.rb

The app should start and continue running until you hit CTRL+C.

In your second shell, start the web app. You could simply start it using:

ruby twatcher.rb

However, assuming you've installed the Shotgun gem according to the instructions above, I recommend using the following command instead:

shotgun twatcher.rb

This will cause Sinatra to automatically reload modified files during development, similar to the default behavior in Rails.

If everything started successfully, you should now be able to bring up the site in your browser. If you've started the web app by itself, hit port 4567. With Shotgun, hit port 9393 instead.

Conclusion

I hope you can appreciate how little code it took us to implement a complete Twitter filter web application, complete with AJAX updates. I count around 150 lines of code, and this includes plenty of comments and whitespace (granted, including the stylesheet it would be closer to 300 lines).

I also hope I've managed to pique your curiosity about Redis, Sinatra, and the TweetStream library. Many of us (myself included) tend to stick with the tools we're familiar with, such as Rails and MySQL. But often, surprisingly elegant solutions emerge when using better-suited (and often simpler) tools.

Personally, I am excited about adding Redis and Sinatra to my standard toolset. I am also curious about what other types of applications might be able to get away with simple, ephemeral solutions like this. Definitely something worth exploring...

blog comments powered by Disqus