Leveraging Machine-learning and Crowdsourcing to Process Text Messages in the world’s Less-resourced Languages
Seminar: Research Exchange | October 20 | 12-1 p.m. | Sutardja Dai Hall, Banatao Auditorium, 3rd floor
Robert Munro, Graduate Fellow in Linguistics, Stanford University
Live broadcast at mms://media.citris.berkeley.edu/webcast; Questions can be sent via Yahoo IM to username: citrisevents. The schedule for the fall Research Exchange is at http://www.citris-uc.org/events/RE-fall2010.
Text-messaging has quickly become the dominant form of remote communication in much of the world, surpassing email, phone calls and even grid electricity. This has social development and crisis response organizations to leverage mobile technologies to support health, banking, access to market information, literacy and emergency response. The need to process this information, either manually or automatically, is growing, as are the number of languages that people are texting in. Right now, you could find speakers of about 5000 languages at the other end of your phone, but for the majority of these languages the world’s entire electronic resources will consist of only a handful of academic papers. In this presentation I will talk about two recent projects that necessitated processing large volumes of communications in less-resourced languages, utilizing crowdsourcing and machine-learning technologies.
In the wake of the January 12 earthquake in Haiti a small group of us quickly established a text-message-based emergency response and reporting service. With messages in Haitian Kreyol, I worked with more 1000 crowdsourced volunteers, coordinating the translation, geolocation and categorization of tens of thousands of incoming emergency messages that were then streamed back to the predominantly English-speaking emergency responders. I will talk about the benefits of collaborative corwdsourcing as a means to apply ‘local’ knowledge from anywhere in the world.
In the second project, we are partnering with a medical clinic in Malawi that uses text-messaging in the Chichewa language to communicate with remote health workers, supporting a patient population of about 250,000. Here, we are using machine-learning and natural language processing methods that learn to categorize incoming text messages according to topic (patient-related vs administrative, specific diseases, etc). The goal is to aid the clinic’s work-flow practices and help identify potential outbreaks early on. I will talk about the necessity of modeling spelling and suffixing/prefixing variation, and how machine-learning methods can automatically adapt to these variations using language-independent methods, indicating a wide deployment potential.
Available Now: Munro’s talk at CITRIS