International development workers have a wide range of skills and interests. We believe they should not also have to become experts in ETL and data science to reap the benefits of the universe of available of aid datasets. IATI data sets in particular contain a great deal of valuable program information, and our aim is to present and characterize the data in ways that facilitate its use real-world programs. With this goal in mind, we've sought to structure our implementation towards clarity and accessibility for the end user and without duplicating the work of other available tools in the sector.
Provides scalable and flexible storage for unstructured data
Graph-specific data store for easy analysis and display of organization network. Used during EDA but not final implementation.
For data cleansing and implementation of data quality scoring algorithm
Powers nearest-neighbor analysis and text similarity elements of quality scoring process
Virtual machines for quick prototyping and production hosting
Lightweight web framework to support API access to back-end data
Front-end framework for agile site implementation
Enables flexible and customized interactive visualizations, especially on quality dashboard
Our first step is to compile copies of all the raw XML files generated by all organizations reporting data about their programs to the IATI standard. These datasets are not hosted in one central location, but rather individually by each organization. Using pointers available from the IATI website, we crawl the list of reporting organization sites, download their XML dataset, and save the raw files to S3 buckets. IATI data is available at the organization level, activity (project) level, and metadata level. Since we aren't sure what will be most useful at this stage, we download everything.
To make the raw data easier to work with, we convert the raw XML into JSON format and load them into MongoDB collections - one for organizations and one for activities. NoSQL data storage works especially well for this content, because it can support the significant variations in structure that are present in the IATI data. As part of this process, we identify any records that fail basic validations, like forming invalid JSONs or containing no data entirely. We also add indices and apply formatting to a few fields to make future queries easier.
Our initial EDA involves the calculation of simple summary statistics about activities. At this stage, we also know we're going to need a way of calculating similarity between organizations. With that in mind, we extract the text of each activity and build a simple bag-of-words model that enables us to calculate distance between documents using cosine similarity. In addition, we make a significant time investment in generating clean and deduplicated organization identifiers, since these will be critical for building the organization graph and creating a clean user experience in our final product.
Now that we have a more detailed understanding of the contents of our dataset, we can generate both data quality and organization graph features. Quality features focus on the completeness of the data, the size balance between the different types of data an organization reports, and the extent to which the digits included in the data adhere to Benford's Law. For the graph, we create edges where one organization's reported data mentions another, using the cleaned organization identifiers created during our EDA. Given the unique structure of graph data, we initially store our graph results in Neo4J instead of the MongoDB collections that contain the rest of the data. Later, we store graph data in memory for improved site responsiveness.
With features in place, we can now build models for our data quality Scores between 0 and 100 are first calculated for completeness, compliance, and utility. These three subscores are then combined together by calculating the Euclidean distance from a perfect overall score of 100. We've chosen this method of subscore aggregation because a similar approach is used by the popular nonprofit rating site Charity Navigator. As a result, aid professionals are familiar with frameworks like this. This maximizes our product's usability among our target audience. The final 0-100 score is converted to a letter grade using the same approach typically used in academia, again for ease-of-use reasons. Read more about our scoring algorithm.
As we add new features and debug old ones, we consistently revisit our scoring algorithm to ensure that it's staying in balance. For example, we needreconsider any change that adds sigificant skewness to the score distribution. During optimization stage we also bring together the quality scoring algorithm and the relationships captured in the graph database. In situations where an organization has missing or incomplete data, we examine its neighbors in the graph. If a neighbor meets similarity thresholds, we attempt to impute the missing data where it exists. This approach is similar to a weighted vote relational neighbor classifier.
Once all of our underlying models are created, we create a flexible front-end to enable users to interact with them directly. It's critical that our tools are intuitive and easy-to-use for busy aid practitioners who likely don't have a sophisticated data background or the time to grapple with complex user interfaces. With this in mind, we strive to make the initial presentation of data simple, with the option to explore more deeply via modal popups and tooltips should users want more detail. Throughout this iterative development process, we rely on feedback from real aid professionals. Mark Maxmeister, Chief Innovation Officer at Keystone Accountability, graciously volunteers to serve as our primary UX reviewer and provides valuable input on how to make AidSight as practically useful as possible.
Getting the measures of data quality correct was a major data science challenge for us in this project. While there are a small handful of other IATI-driven sites that briefly mention data quality in passing, it's always an afterthought. As a result, there is no set of precedents or best practices that we can rely on. Rather, we've made our design choices here based on domain expertise and practitioner feedback.
Our approach to operationalizing data quality was to pick a set of starting features, build a simple baseline model to generate distributions, then tune the model iteratively. As we added new features, the underlying distribution would shift, sometimes dramatically, leading us to have to "rebalance" our system of calculating scores. Implementing the score creation pipeline as a series of discrete steps packaged in a Jupyter notebook enables rapid iteration and would support the inclusion of additional quality features and metrics in the future.
We welcome additional ideas and feedback on our methodology for creating data quality scores and grades. Rather than duplicating the detailed walkthrough of the process in full here, we invite you to explore the score calculation specifics on Github.
This first iteration of AidSight represents a minimum viable product aimed at understanding the validity of the concept, the potential demand among our presumed user base, and the challenges associated with practical uses of aid sector data.
While we've had success in tackling these primary goals, we've prioritized our ability to do efficient EDA and pivot our implementation choices, particularly in the first stages of the data pipeline. As a result, there are plenty of opportunities for additional improvements and updates that would continue to grow the impact that AidSight has in the community. Specific enhancements we are investigating include: