Eric Rochester

Omeka, Neatline, Mac, development, oh my!

This is cross posted at The Scholars’ Lab Blog.

At the Scholars’ Lab, we’re big big advocates of Open Source. All of our projects are available freely and openly on Github, and we’re always more than happy to accept pull requests. We’d like to be able to empower everyone to contribute to our projects as much as they’re able to and comfortable with.

Unfortunately, one of our flagship projects, Neatline, isn’t easy to contribute to. There are a number of reasons for this, but one is that the development environment is not trivial to get set up. In order to address this and make it easier for others to contribute, we’ve developed an [Ansible][ansible] playbook that takes a not-quite-stock Mac and sets up an instance of Omeka with the Neatline plugin available, as well as all the tools necessary for working on Neatline.

[Ansible][ansible] is a system for setting up and configuring systems. It’s often used to set up multiple servers—for instance, a database server and a static web server, both working with a dynamic web applications deployed on

Mastering Clojure Data Analysis

After a few delays, I’m pleased to announce the release of Mastering Clojure Data Analysis.

This book is a case study of ten different data analysis topics using Clojure. It applies a number of Clojure and Java libraries to an assortment data analysis and machine learning techniques.

To give you a taste of what it covers, here is the table of contents:

  1. Network Analysis – The Six Degrees of Kevin Bacon
  2. GIS Analysis – Mapping Climate Change
  3. Topic Modeling – Changing Concerns in the State of the Union Addresses
  4. Classifying UFO Sightings
  5. Benford’s Law – Detecting Natural Progressions of Numbers
  6. Sentiment Analysis – Categorizing Hotel Reviews
  7. Null Hypothesis Tests – Analyzing Crime Data
  8. A/B Testing – Statistical Experiments for the Web

Software Development for the MA Humanities Student

This was originally posted on the Scholars’ Lab blog. I’ve cross posted it here.

This is not a transcript of a brief panel talk I gave for the UVa Graduate English Student Association Career Panel. It’s based on what I hope to say, but I’m actually writing this before the event so it (and its links) can be available beforehand.

About me

I’ve been interested in two things for about as long as I can remember: computers and literature. These intersected a little in science fiction and fantasy, but largely, the two obsessions remained strangely separate. I’d spent a lot of time reading, both “literature” and “trash”; but I’d also enjoyed playing computer games and trying to program my own.

It wasn’t until half-way through my PhD program at <a

A Simple Dataflow System

This is a recipe that I wrote for the Clojure Data Analysis Cookbook. However, it didn’t make it into the final book, so I’m sharing it with you today.

When working with data, it’s often useful to have a computer language or DSL that allows us to express how data flows through our program. The computer can then decide how best to execute that flow, whether it should be spread across multiple cores or even multiple machines.

This style of programming is called dataflow programming. There are a couple of different ways of looking at dataflow programming. One way describes it as being like a spreadsheet. We declare relationships between cells, and a change in one cell percolates through the graph.

Parallel IO with mmap

This is a recipe that I wrote for the Clojure Data Analysis Cookbook. However, it didn’t make it into the final book, so I’m sharing it with you today.

Parallelizing the pure parts of processes, the parts that don’t involve side effects like reading input files, is relatively easy. But once disk IO enters the picture, things become more complicated. There’s a reason for that: when reading from a disk, all threads are inherently contending for one resource, the disk.

There are ways to mitigate this, but ultimately it comes down to working with the disk and the processing requirements to get the best performance we can from a single, sequential process. So to be completely honest, the title of this post is a little misleading. We can’t parallelize IO on a single, shared resource. But there

Downloading Data in Parallel

This is a recipe that I wrote for the Clojure Data Analysis Cookbook. However, it didn’t make it into the final book, so I’m sharing it with you today. If you like this, check out the book.

Sometimes when getting resources, we have to download them from many URLs. Doing that sequentially for one or two sources is fine, but if there are too many, we will really want to make better use of our Internet connection by downloading several at once.

This recipe does that. It chunks a sequence of URLs and downloads a block in parallel. It uses the http.async.client library to perform the download asynchronously, and we’ll simply manage how we trigger those jobs.

Aggregating Semantic Web Data

This is a recipe that I wrote for the Clojure Data Analysis Cookbook. However, it didn’t make it into the final book, so I’m sharing it with you today. If you like this, check out the book.

One of the benefits of linked data is that it is linked. Data in one place points to data in another place, and the two integrate easily. However, although the links are explicit, we still have to bring the data together manually. Let’s see how to do that with Clojure.

Getting ready

We’ll first need to list the dependencies that we’ll need in our Leiningen project.clj file.

Clojure Data Analysis Cookbook

I’m pleased to announce the release of the Clojure Data Analysis Cookbook, written by me, and published by Packt Publishing.

This book has practical recipes for every stage of the data analysis process:

  • acquiring data,
  • cleaning it,
  • analyzing it,
  • displaying and graphing it, and
  • publishing it on the web.

There’s also a chapter on statistics and one on machine learning, as well as the obligatory (for Clojure, anyway) chapters on parallelism and concurrency.

From the book’s blurb: