Harnessing Machine Learning to Power Data Platform
Duke students sponsored by the Energy Access Project and working under the Duke Data+ program partnered with Power for All, a leading energy access advocate and information outlet, to explore how machine learning and natural language processing methods can be applied to improve Power for All’s Platform for Energy Access Knowledge (PEAK).
PEAK is an interactive information exchange platform designed to help aggregate and repackage the best research and information into compelling data-driven stories to bring energy access to all. The team developed tools to curate, organize, and streamline large bodies of data into digestible and sharable knowledge that will improve the platform and better enable it to meet its goals of educating policymakers and bringing data to researchers. While organizations have produced a wealth of energy access data, a large portion of this data or the documents the data is embedded in are difficult to access, download, or put into appropriate context. This can also lead to wasted effort among those who perform research: for instance, a 2014 World Bank review found that nearly one-third of the Bank’s policy papers have never been downloaded.
This project explores how machine learning and natural language processing tools can facilitate improvements for Power for All’s Platform for Energy Access Knowledge (PEAK), which automatically curates, organizes, and streamlines large, growing bodies of data into digestible and sharable knowledge to better inform policymakers and researchers alike. The team developed three innovative techniques to expand the analytical abilities of the PEAK database: automatic identification of documents related to energy access, automated extraction of tabular data from PDFs to machine-readable tables, and natural language processing tools to build a dictionary and identify document keywords from that dictionary. Some of these tools may prove to have wide utility outside the energy access space as well.
The success of this project demonstrates the power of leveraging open-source technologies with novel algorithms to solve difficult data problems. Automated document processing is immensely useful for handling the overwhelming amount of data available today, and algorithms developed by the team provide a way to extract this previously untapped data.
We achieved an overall accuracy level of 67% as compared to 35% with an existing tool.
We were able to auto-detect 81% of tables and extract 82% of detected tables.
Of the 12,860 documents in the training set, only 1,046 documents, or 8%, were labeled as relevant. However, our classifier consistently achieved an accuracy of 98%, recall of 95% and precision of 95% on the test set.