If you’re here because you just want the tool, skip down to the Solution section below.
Background
In order to grow as a machine learning engineer, I recently took on a volunteer project with the international collaborative platform Omdena, which brings people together from all over the world to work on a specific AI for Good project over 8 weeks. The project I took on is analyzing tweets from Chicago that were posted specifically from areas known to be high in crime and gang behavior. There are many approaches to this problem, and our group has been trying different ones in parallel, but the one that stood out to me is train a model which can classify tweets as threatening or non-threatening so that the threatening ones can be routed to intervention specialists who will then decide what action to take. This is a supervised learning task, so we would need labeled data, but we were only provided the raw tweets. In a non-hierarchical collaborative environment like Omdena, you simply have to look around for what task needs to be done, and then do it, rather than waiting to be assigned something. Since labeling is not sexy, no one else volunteered to do it, so I took the lead.
Research
We needed a way for many people to be able to quickly label tweets online. Searching the web, we found LightTag, which is a product designed for exactly this. It looked really great and is actually more powerful than what we needed. But, when I was evaluating it, I found that it takes between 5-10 seconds to load the next tweet once you click submit, so this was simply not acceptable for us. It would slow our labeling down to a crawl. Also, it is a paid product once you exceed the comically low number of free labels, so we would have to petition them for a free license as a non-profit. We needed a simpler solution that does everything we need, and nothing else. So, I turned to a trusted old friend: Google Spreadsheets
Solution
I made a custom Google Spreadsheet, and I’ve made the template publicly available here. It features a scoreboard, so labelers get credit for their contribution, and a mechanism to have at least two people label each tweet to ensure the quality of labels. The labeling task organizer can start by going to File -> Make a copy and save the template to your Google Drive. Next, fill in the data in the sheets named Page 1 through 28. Each page supports up to 2,000 tweets. As a labeler, you put your name on the scoreboard sheet, and then choose a data page to start on. Then, claim one of three columns as yours by putting your name in cell A2, B2, or C2 (it must be exactly the same string as in the scoreboard). Hide the other two columns label columns containing other people’s labels so they don’t influence. Right-click on the column header and then click “Hide column.” Then start labeling each row. As a group, you must decide on the labeling convention. For example, you can use 0 and 1 for whether it matches your criteria or 1 to N for one of N categories or string labels, the choice is yours, just be consistent.
To ensure the quality of our labels, we decided we need at least two labels on every tweet, and if they are not the same, a third label would be required to break the tie. Row color-coding makes it easy to see which rows are finished. If the row has been labeled once, it will be colored green. If the row has been labeled twice and the two labels do not agree, it will be colored red. That means a third label is needed to break the tie. If the row has three labels or two agreeing labels, it will be colored blue, which means that the row is done. Also on the scoreboard page, is a count of how many tweets are labeled once, labeled twice with conflicting labels, and finished on each page.
To motivate your labelers, you can choose whatever form of reward you like. Never underestimate the power of social recognition. I found that simply announcing the top three labelers during our weekly status update calls was enough to motivate people to spend hours on this otherwise tedious task.
Once finished, you can export the data to xlsx format (File -> Download -> Microsoft Excel (.xlsx)), and then you can import it using this script. Alternatively, you can pull data directly from Google Spreadsheets using this script.
I hope this helps someone out there save time in labeling text data, so you can spend more time doing interesting things such as modeling. If you use this, please leave a comment to tell me what you’re using it for or email me at [email protected]. I would love to hear from you.