Fix accented filenames on Linux with convmv

I’m working on a WordPress-powered site covering politics in Costa Rica. Today, an editor called me up asking why some images weren’t showing up in his browser, Safari 6.1.

The issue turned out to be caused by Unicode equivalence conflicts.

OS X normalises accented characters using Unicode NFD (Normalization Form Canonical Decomposition), whereas Linux and most other operating systems use NFC (Canonical Composition).

The editor had created files that contained an accented í on his Mac, then uploaded those to the WordPress media library. The file was saved to the filesystem on the Linux server as NFD and referenced in the WordPress database also using the original NFD filename. This caused two problems.

The first was that Safari was normalising the URL to NFC, requesting 140201LuisGuillermoSol%C3%ADs.jpg instead of LuisGuillermoSoli%CC%81s.jpg.

The second was that I couldn’t rename the file on the server filesystem. Any attempt at using mv to rename the offending files resulted in cannot stat errors.

The solution was multi fold. I first installed convmv (apt-get install convmv) and used the following command to rename the files in bulk.

convmv --nfc --nosmart --notest -r -i -f UTF-8 -t UTF-8 /path/to/files

After doing that, I was able to rename the files using mv.

Then I manually edited the entries in the WordPress database to remove the accents. Panic over.

To avoid the issue in the long term, I added the following filter to the theme’s functions.php:

add_filter('sanitize_file_name', function($filename) { return remove_accents($filename); }, 10);

remove_accents is a WordPress function.

Automatic routing

Most of the time, standards make life easier. I was writing the routing configuration for a new web app. My configuration looked something like the following. :d is a placeholder for an identifier.

{ "/files/": { "controller": "files", "methods": ["GET", "POST", "PUT"] }, "/files/:d": { "controller": "sources" "methods": ["GET", "DELETE", "PUT"] }, "/sources/:d/files/": { "controller": "sourceFiles" "methods": ["GET", "POST", "PUT", "DELETE"] } }

And so on.

Is listing the methods really necessary? My application uses URLs to represent resources and groups. File resources and groups or collections of file resources.

By ending a path with a trailing slash, we make it clear that it represents a collection. Without the trailing slash, it’s ambiguous whether the URL refers to a resource or a resource group.

The original paper on representational state transfer didn’t mention this. I think it’s a useful convention.

From there on it’s just a small step to put some heuristics in our router so that implied methods are wired up.

Routes should generally implement GET, HEAD and OPTIONS. So let’s start our routing logic with that. All routes should implement PUT and DELETE, so that’s easy too.

Note that implementing doesn’t necessarily mean that the method is allowed. For example, if your app doesn’t allow entire collections to be deleted, you could always reply to a DELETE on a collection like /files/ with 405 Method Not Allowed.

Routes with a trailing slash are collections, so those should support POST, which will be used to a create a new resource and add it to the collection in a non-idempotent way. That last bit means that the action can’t be repeated without side-effects - that is, that the end-state of the /files/ collection changes with every request because a new resource is created and added each time.

PUT on the collection would replace the entire thing. Again, although the router would implement the method, you could disallow it.

A URL without a trailing slash represents a single resource. Does it make sense to POST to that? Probably not, so our router would disallow that automatically. A PUT however, would be implemented as it’s used to update the resource from the client. You could respond with 401 Unauthorized if you require authentication for users to change things.

So there you have it. Routing heuristics for the peoples.

Automating choropleth mapmaking in the newsroom

Last weekend I launched Tuanis, a web application for easily producing choropleth maps using Google Spreadsheets and bit of option toggling.

Here’s one example showing the tons of household and commercial waste produced per month in each canton.

The project started like this. For the past few months I’ve been working as a web developer at La Nación in Costa Rica. A few weeks ago, I made an SVG choropleth map for recycling statistics in Costa Rica. I wrote a Makefile that converted shapefiles from to GeoJSON using GDAL’s ogr2ogr, then to TopoJSON.

On the browser side, I used d3 to generate the SVG from the built TopoJSON file on every client load and munge in the statistics from a CSV file that my journalist colleague Hassel Fallas produced. Obviously, this process is extremely inefficient. Never mind the fact that I kept thinking it could all be so easily automated.

While figuring out the automation, I came across and realised that not only was this a better way to do produce an SVG, but I could automate the entire process so that journalists can generate and embed maps themselves.

My new Makefile runs against the shapefiles and produces an optimized SVG, weighing in at about 70kb. The app’s single HTML page loads the SVG using an <object> tag which is then manipulating using JavaScript to change the fill colors when statistical data is loaded.

How is statistical data loaded? That’s the best bit. All you have to do is create a Google Spreadsheet and publish it to the Web. Paste the document URL into Tuanis and hit the load button. When it loads, select the canton ID and statistic fields from the dropdowns.

We use the HTML5 History API to change the URL as you manipulate the controls, so every time you change a setting or load a spreadsheet the URL changes too and anyone you send it to will see the exact same map.

As an added bonus, these permanent URLs enable the map to be easily embedded in any HTML page. Just grab the embed code from Tuanis and paste it into your application.

For the nerdier, the color sets come from a system called ColorBrewer, developed by Cynthia Brewer at Pennsylvania State University. In turn, we’re using Croma.js to provide the ColorBrewer color sets and manipulate the scales.

Oh, and the name “tuanis” means “awesome” in Central American slang.

Community detection with the offshore leaks data

Someone on the Internet ran an algorithm using iGraph that automatically assigns nodes to clusters based on whether they seem to be in a real-life community with each other or not.

What does this mean? A good way to understand this is using your friends on Facebook. As you go through life you meet people in groups, and these groups form separate clusters or communities. For example, your friends from high school form one cluster, your friends from college another cluster. When you start a new job all the new colleagues you add on Facebook become a new cluster.

One of the ways to visualize this is by clumping together nodes that are part of the same cluster and assigning them the same color. That’s what this person has done.

For example, the majority of the addresses in Mexico are all in San Pedro Garza Garcia, the wealthy suburb of Monterrey where Dionisio Garza Medina lives. Those addresses and the Grupo ALFA subsidiaries in the database, along with Garza Medina himself, are part of the same cluster.

If there are other clusters in Mexico then by visualizing them in this way we’d be able to see fairly quickly where they intersect, allowing you to see the big picture.

It gets even more interesting when you add time series. Because we have dates for the roles then we can build a progressive series of visualizations that show how the communities wax, wane, merge and divide over time.

For example, if when looking at the progression of visualizations we see that a cluster with a strong presence in Switzerland appears to explode in size at some point in time, then we can dive into the data and see that it happened when person X became the director of one of the Swiss entities. That could indicate that X was the go-between responsible for dealing with Portcullis TrustNet and that he or she hooked up all the other companies.

Sync your hosts file to the Android emulator with aemu

The Android emulator has become a lot more usable since the use of HAXM, but there are still a few annoyances, like the fact that the host OS’s ‘hosts’ file isn’t used during name resolution.

Editing the hosts file on the emulator itself is very difficult, time-consuming and error-prone. I really don’t want to have to do that every time I need to use the emulator to test a local host.

So I’ve written a bunch of tools for the emulator called aemu and one of the packaged tools is called aemu-hosts. It allows you to launch a virtual device and sync your hosts file to it in one command.

Also included is aemu-sms, which I’m using as a hacky way to ‘paste’ things into the running virtual device, as there’s no normal copy/paste functionality.

Resource groups in REST

Quoted from Programmers.

A URL that doesn’t end in a slash names a resource, while one that does is a resource group.

A GET of a URL with a slash on the end is supposed to list the resources available.

GET /* List all the dogs resources */

A PUT on a URL with a slash is supposed to replace all the resources.

PUT /* Replace all the dogs resources */

A DELETE on a URL with a slash is supposed to delete all the resources.

DELETE /* Deletes all the dogs resources */

A POST on a URL with a slash is supposed to create a new resource that can then be subsequently accessed.

POST /* Creates a new dogs resource (notice singular) */

To be conformant the new resource should be in this directory and the server should respond with 201 Created and a Location: /dogs/1 header field (or just 202 Accepted if the resource can’t be created before the response and you want to show a progress page).

Get the Android emulator running mega fast

This works on Mac, Windows and Linux.

Run android in the terminal to open the SDK application. Make sure Intel x86 Emulator Accelerator (HAXM) underneath Extras is installed. If not, install it. Then go to Tools → Manage AVDS….

Make sure the CPU/ABI field of your virtual device is set to Intel Atom (x86). You can even tick Use Host GPU in the settings for even more acceleration. Select it from the list and click Start….

If you see HAX is not working in the log output then you need to install the Intel Hardware Acceleration Execution Manager.

Once you’ve done that, try starting the virtual device again. You should now see HAX is working and emulator runs in fast virt mode in the log output.

A bunch of things Meteor does right at first glance

Here’s what I think after reading the docs, with most bits liberally paraphrased from there.

Not just the same language, but the same API on both client and server

Remember the days when you thought you were so awesome because you were using JavaScript on both the client and the server? Hah!

Meteor wraps things up nicely for you so you get the same API and don’t waste time, for example, choosing separate routers for the client and the server. Meteor Router provides the same API everywhere.

Granted, we’re not all the way there yet. Templating doesn’t yet work on the server side, for example. But at least the goal is there:

A future version of Meteor will also send HTML to web browsers on inital page load. The Meteor templating system was designed specifically to support this use case.

Automatic resource bundling

Meteor gathers all JavaScript files in your tree with the exception of the server and public subdirectories for the client. It minifies this bundle and serves it to each new client.

CSS files are gathered together as well: the client will get a bundle with all the CSS in your tree (again, excluding the server and public subdirectories).

Templates are converted into JavaScript functions, available under the Template namespace in your code.

Eventual consistency

Every Meteor client includes an in-memory database cache. The server publishes sets of JSON documents, and the client subscribes to those sets. As documents in a set change, the server patches each client’s cache.

When a client issues a write to the server, it also updates its local cache immediately, without waiting for the server’s response. This means the screen will redraw right away. If the server accepts the update then the client got a head start on the change and didn’t have to wait for the round trip to update its own screen. If the server rejects the change, Meteor patches up the client’s cache with the server’s result.

Yes, patches. It sends a fragment of the dataset containing the correct data rather than the whole thing.

Reactive programming

When there’s a change to data used by a template rendering function, it’s re-run. DOM nodes are then updated in-place, no matter where they were inserted on the page. It’s completely automatic.

Meteor normally batches up any needed updates and executes them only when your code isn’t running. That way, you can be sure that the DOM won’t change out from underneath you.