Rails And JSON Containing Unicode Characters

As I mentioned in a previous blog post, Rails 2.1 natively supports incoming JSON requests. Unfortunately, it still struggles with JSON data containing non-ASCII characters.

According to the JSON spec, JSON fully supports UTF-8 encoded text, so with a few exceptions it generally should not be necessary to escape non-ASCII characters with \u Unicode escape sequences. However, many JSON libraries appear to escape all non-ASCII text in this fashion. This in itself should not be a problem, but ActiveSupport::JSON currently does not properly parse JSON containing \u escapes, resulting in strings with literal \u escape sequences rather than the desired UTF-8 encoded characters. This is especially confusing since ActiveSupport:JSON itself encodes all non-ASCII characters as \u escapes, so one might think that the reverse transformation yields the original data. But this behavior is likely explained by an odd implementation choice for its decoder: Rather than using the json (or json-pure) library, it converts the JSON data to YAML and then uses the YAML library to decode the data into Ruby objects.

Monkey-patching to the rescue! I decided to replace ActiveSupport::JSON::decode with an implementation that uses the json library. The easiest way is to stick the following code into a file named something like activesupport_json_unicode_patch.rb inside the config/initializers/ directory, where Rails will automatically pick it up.

require 'json'
 
module ActiveSupport
  module JSON
    def self.decode(json)
      ::JSON.parse(json)
    end
  end
end

You can verify the fix by adding a test case (I added a file named activesupport_json_test.rb to the test/unit/ directory):

require File.dirname(__FILE__) + '/../test_helper'
 
class ActiveSupportJsonTest < Test::Unit::TestCase
 
  def test_json_encoding
    unicode_escaped_json = '{"foo":"G\u00fcnter","bar":"El\u00e8ne"}'
    hash = ActiveSupport::JSON.decode(unicode_escaped_json)
    assert_equal({'foo' => 'Günter', 'bar' => 'Elène'}, hash)
  end
 
end

This test should fail without the patch and pass after adding it.

In addition to fixing the JSON / Unicode problem, this patch should also provide a nice speed boost, as we’re replacing the somewhat roundabout YAML based JSON decode method with a native one (particularly if you’re using the native json implementation rather than json-pure.)

3 Responses to “Rails And JSON Containing Unicode Characters”

  1. glenn Says:

    Nicely done. I get the feeling there is still a lot most of us (myself especially) have to learn about doing l18n properly in Rails 2.1

  2. DigitalHobbit Says:

    I’m definitely still learning how to do i18n with Ruby / Rails as I go along.

    As a recovering Java developer, I have to admit that Java absolutely nailed i18n, pretty much right from the beginning. Strings are always proper Unicode strings, as opposed to Ruby’s (at least pre-1.9) fancy byte arrays. There’s built-in support to decode text encoded in all the common encodings into Java Strings, solid support for resource bundles and locale awareness, and more.

    Oh well, I’m still having more fun with Ruby… :)
    (and it looks like both the Rails community as well as Ruby itself are committed to improving the current i18n situation)

  3. Darren Says:

    I had a similar problem with the flickraw library.
    Internally it used YAML to parse JSON which completely failed on Unicode.

    In my case I modified the library to use JSON.
    Will email the author once I’m happy it’s not created any new problems.

Leave a Reply