Open Refine Tutorial
What is Open Refine?
Open Refine is a free and open source tool that cleans, transforms, and reconciles data. In this digital age, more data is available online for collaboration and analysis. Unfortunately, it is often not in a form that easily facilitates computation. Rather than cleaning data manually in Excel or Open Office, Open Refine allows users to perform operations across any size data set with a few commands. It can perform the simple deletion of replications to the more complicated action of reconciling data with online databases.
How does it work?
Open Refine is downloaded from the internet but works as a desktop application. It provides a browser interface in which to work but does not require data to be uploaded to a server. Once installed, the user can import local data sets in various formats, online data sets, or data sets provided by Google. Open Refine records every edit and transformation made to each cell. The user can export the final product in various formats, as well as download the revision history.
This tutorial will use a messy data set on eighteenth-century midwifery at George Washington’s Mount Vernon. It is not complete, but the data will demonstrate the importance of clean data in building one’s own data sets and in cleaning data sets made by others.
Installation and Data Importation
Download Open Refine. After installing, the tool will open in a browser. You can create, open, or import a project. Using the tool for the first time, you will create a new project. Click Create Project, then select the mid_practice.csv file from your local computer and import it.
The data should appear in the data preview mode. Below the data, you can select data viewing and format options. For this tutorial, every option should be correct. It is a data set with commas separating the values. You can name the project whatever you like then click Create Project.
First, you should know that Open Refine uses Google Refine Expression Language, or GREL. It offers a command line interface in addition to simple clicks of the mouse for data cleaning. This tutorial will examine the basics of both methods. It is helpful to explore the data and examine its variables: values, rows, cells, and records. After exploring the data, you can begin to clean it.
Extra Spaces- Open Refine allows you to easily find and clean cell values with extra spaces. You simply click the column name, then Edit cells, Transform cells, and Common Transforms. You can choose from a variety of commands, but for this example we will trim leading and tailing whitespaces for each column.
Editing one cell at a time- You can change text within one cell. For example, if you want to change some of the cells in the “Payment” column, simply click Edit in the individual cell and type in the new value. You can also apply this change to all matching cells. What happens if you have a column with a lot of different cell values and variations?
Faceting- By faceting, you can filter and count the number of differing cells. This function is similar to the filter function in R. These facets appear on the left side of the data with the count of each facet. You can change this value and the count will change appropriately. For example, if we apply a text facet to the “Midwife” column, Open Refine will put every value variation into its own group and provide the count of each. Click on the column name, then Facet, then Text Facet. If we know that “Mrs. D Parker” is the same as “Darcus Parker”, then we can easily standardize this value. Faceting also works with numerical values.
Clustering- The facet feature also provides clusters of the different facets. If two or more values are grouped together, you can merge them with the new value in one click. Click Cluster in the Facet box and a new window will appear. You can choose which method to cluster by, but the default is key collision and fingerprinting. Ngram also works well. You choose which groups to merge, define the new cell value, and click Merge Selected and Recluster.
Another way to clean the data is through transformations. This key feature in Open Refine uses the command line interface. You can click in any column on the Edit cell button then Transform cell to open the command window.
Value.replace- This function transforms any cell value into another value. For example, some name variations did not appear as clusters when faceted. If we want to replace “Mrs. Brasenton” with the standardized “Mrs. Brasington” in the Midwife column, we can use:
value.replace(“Mrs. Brasington”, “Mrs. Brasenton”)
The value you want to replace comes first, followed by the string you want to replace it with.
Value.toDate- Let’s say we want to reformat the “Date paid” column from “mm/dd/yyyy” to “yyyy-mm-dd”. The command would look like this:
This function takes the date format entered and transforms it into the specified date string.
Value.split(” “).reverse().join(“, “)- This function incorporates three commands: split, reverse, and join. It splits strings, then puts them back together in the reverse. If we want to put the Midwife names in as “Last Name, First Name”, then we would use this function.
value.split(” “) Splits the string by its separator
value.reverse() Puts the last name first
join(“, “) Joins the reversed strings back into one column with a comma and a space
Split multi-valued cells- For each record, you can split column cells with multiple values without creating new columns. Open Refine adds rows underneath the original record. You can define what to split by, such as “|” or “,”. If we want to split Betty’s twins in 1778 into two rows, then we can choose Edit Cell then Split multi-valued cells for that column. The program prompts us to define what operator separates the values, “,” in this case. Open Refine then adds the different value as a new row in the Betty record number. Switching from Row view to Record view at the top of the data set shows the multiple rows for that one entry.
This feature allows you to match text to database keys and add new columns, thereby adding to your data. OpenRefine is currently trying to develop a reconciliation service for Historical Newspapers. Data Reconciliation not only enhances your own data, but promotes digital collaboration and sharing of data across scholars and institutions.
Geocoding- Open Refine can geocode street addresses in order to produce maps from data. You can Add Column by Fetching URLs, which allows you to extract out the latitude and longitude coordinates. Open Refine uses the Google Geocoding API. This is also similar to the geocode function in R.
Application to History
OpenRefine is important to historians both using and building datasets on the web, because it easily explores, cleans, transforms, and reconciles data on any scale. Its design is conducive to collaboration on large datasets, just like DAT. Open Refine also speaks to those historians who want to make their research reproducible. The tool has an undo/redo section in which the user can track every change he or she has made to the data. You can then extract the dataset out at each step. The original data is not altered, and you can always go back to it. This feature makes your data cleaning reproducible on many levels. If you have a similar data set and need to perform the same commands to it as a previous set, then you simply apply those changes to the new data set. Open Refine also lets users export an entire project as a .tar.gz file. This means that another scholar can then import it into their own local Open Refine. The exported file contains both the cleaned data and the revision history. The person receiving the project can easily see the previous changes, undo them and make their own, and extract out a whole new data set.