Harnessing Machine Learning to Power Data Platform
Duke students sponsored by the Energy Access Project and working under the Duke Data+ program partnered with Power for All, a leading energy access advocate and information outlet, to explore how machine learning and natural language processing methods can be applied to improve Power for All’s Platform for Energy Access Knowledge (PEAK).
This project explores how machine learning and natural language processing tools can facilitate improvements for Power for All’s Platform for Energy Access Knowledge (PEAK), which automatically curates, organizes, and streamlines large, growing bodies of data into digestible and sharable knowledge to better inform policymakers and researchers alike. The team developed three innovative techniques to expand the analytical abilities of the PEAK database: automatic identification of documents related to energy access, automated extraction of tabular data from PDFs to machine-readable tables, and natural language processing tools to build a dictionary and identify document keywords from that dictionary. Some of these tools may prove to have wide utility outside the energy access space as well.
The success of this project demonstrates the power of leveraging open-source technologies with novel algorithms to solve difficult data problems. Automated document processing is immensely useful for handling the overwhelming amount of data available today, and algorithms developed by the team provide a way to extract this previously untapped data.
We were able to auto-detect 81% of tables and extract 82% of detected tables.
Of the 12,860 documents in the training set, only 1,046 documents, or 8%, were labeled as relevant. However, our classifier consistently achieved an accuracy of 98%, recall of 95% and precision of 95% on the test set.