Harnessing Machine Learning to Power Data Platform

Building the intelligence behind Power for All's PEAK tool

James E. Rogers Energy Access Project / Projects / Harnessing Machine Learning to Power Data Platform

Duke students sponsored by the Energy Access Project and working under the Duke Data+ program partnered with Power for All, a leading energy access advocate and information outlet, to explore how machine learning and natural language processing methods can be applied to improve Power for All’s Platform for Energy Access Knowledge (PEAK).

PEAK is an interactive information exchange platform designed to help aggregate and repackage the best research and information into compelling data-driven stories to bring energy access to all. The team developed tools to curate, organize, and streamline large bodies of data into digestible and sharable knowledge that will improve the platform and better enable it to meet its goals of educating policymakers and bringing data to researchers. While organizations have produced a wealth of energy access data, a large portion of this data or the documents the data is embedded in are difficult to access, download, or put into appropriate context. This can also lead to wasted effort among those who perform research: for instance, a 2014 World Bank review found that nearly one-third of the Bank’s policy papers have never been downloaded.

This project explores how machine learning and natural language processing tools can facilitate improvements for Power for All’s Platform for Energy Access Knowledge (PEAK), which automatically curates, organizes, and streamlines large, growing bodies of data into digestible and sharable knowledge to better inform policymakers and researchers alike. The team developed three innovative techniques to expand the analytical abilities of the PEAK database: automatic identification of documents related to energy access, automated extraction of tabular data from PDFs to machine-readable tables, and natural language processing tools to build a dictionary and identify document keywords from that dictionary. Some of these tools may prove to have wide utility outside the energy access space as well.

The success of this project demonstrates the power of leveraging open-source technologies with novel algorithms to solve difficult data problems. Automated document processing is immensely useful for handling the overwhelming amount of data available today, and algorithms developed by the team provide a way to extract this previously untapped data.

Duke Faculty: Rob Fetter, Jonathan Phillips

Duke University Students: Brooke Erickson, Alejandro Ortega, Jade Wu

Non-Duke Faculty/Staff: Rebekah Shirley, Scott Barnard, Wayne de Jager

We achieved an overall accuracy level of 67% as compared to 35% with an existing tool.

We were able to auto-detect 81% of tables and extract 82% of detected tables.

Of the 12,860 documents in the training set, only 1,046 documents, or 8%, were labeled as relevant. However, our classifier consistently achieved an accuracy of 98%, recall of 95% and precision of 95% on the test set.

PEAK

PEAK

Over the summer as part of Duke University’s Data+ program, Duke student teams deployed cutting-edge data analysis techniques to aid the search for solutions to this global challenge. Guided by Duke faculty, students learn how to marshal, analyze, and visualize data, while gaining broad exposure to the modern world of data science. Both teams’ research efforts contribute to the goals of Duke’s Energy Access Project, a new research and policy effort that aims to address the challenges around increasing access to modern energy solutions to underserved populations around the world.

Learn More

Harnessing Machine Learning to Power Data Platform

PEAK

PEAK

PEAK

The success of this project demonstrates the power of leveraging open-source technologies with novel algorithms to solve difficult data problems.”

Workstreams

Keywords

Projects

Methodologies

Technology