Skip to the content.
+

Google Summer of Code 2022 with RADAR-Base: RADAR-Pipeline

This is a summarization of the work I did for the RADAR-Pipeline project under the RADAR-Base organization as part of Google Summer of Code 2022

1. Table of Contents

2. Project Details

3. Abstract

This project enhances the RADAR-Pipeline by adding support for ingesting the data, computing features from this data and then finally exporting these computed features to a file. All of these steps can be configured by a user or a researcher through YAML syntaxes specified in a single config file.

The ingestion of data is done by using Apache Spark through pyspark. This ingested data is internally represented as a DataFrame, which makes performing operations on the columns to extract features easier. The extracted features from this DataFrame are then exported to a file.

4. RADAR-Pipeline

RADAR Pipeline is an open-source python package to help researchers and users working with RADAR data to ingest, analyze, visualize, and export their data, all from a single place. The package is designed to be flexible and extensible. The pipeline aims to:

5. Project Goals

The following are the goals that were set for the project at the beginning of the program. ✅ represents that the goal has been achieved. 🚧 represents that the goal is in progress and 📝 represents that the goal is to be done in the future.

Apart from these goals that were set at the beginning of the project, we also added the following features to the project along the way:

6. Merged Pull Requests

Before official coding period

Before the official coding period, when I was setting up the codebase and installing the dependencies, I had to make a few changes to the codebase. These changes were made in the following pull requests:

1. Change yaml to pyYaml and add pysftp in requirements

This was a minor pull request fixing some of the dependency errors that I was facing while setting up the codebase. It also added the pysftp dependency which was required for reading data from an SFTP server and changed the yaml dependency to pyYaml as the former was deprecated.

2. Some minor corrections and typo fixes

This pull request fixed some of the typos and refactored the codebase along with improved error handling for some exceptional cases.

During official coding period

The following pull requests were made during the coding period:

1. Add mock data input through pySpark

This pull request added the ability to read mock data which could be added as a submodule to the repository. The data, which is in the form of gzipped CSV was ingested using Spark through pyspark. This pull request also added the ability to read a schema file and use it to read the data much more efficiently. Apart from this, the error handling and logging was also improved and the code was refactored to make it more readable and the instructions for running the pipeline was documented. Code formatting and linting was also done to make the codebase consistent.

2. Ingestion improvements

This pull request changed the way data was being read from reading it file by file by passing each filename to the Spark File Reader to reading it in a single pass by passing the directory name to the Spark File Reader. This improved the speed of data ingestion tremendously. Apart from this, the mock data was also checked before running the mock pipeline and if it was not available, then it was downloaded to avoid errors while reading. The codebase was also refactored to remove some datatype inconsistencies and improve the readability of the code.

3. Improve features codebase

This pull request was made in the mock features repository. It removed the dependence of the mock features on the pipeline repository by removing it as a submodule and instead, adding it as a dependency in the requirements.txt file. This pull request also also fixed some previous calculations to improve the feature processing operations.

4. Add Feature and Data Export Modules

This pull request added the support for processing the features and exporting the computed features from the pipeline to a file. The codebase in this pull request went through a major refactor because of some folder and file reorganization. The codebase and the documentation was also updated to reflect the changes done in the codebase till now.

7. Achievements

The major achievements of this project are:

8. Open Issues

Through the discussions with the mentors during the course of the project, we identified many new areas of improvement for the pipeline. There are also a few bugs that need to be addressed in the pipeline.

The following are the major issues that were identified an are open that need further work:

9. Future Work

The project still has multiple open issues, along with enhancements / follow-ups to the work I’ve done. Apart from this, a couple of issues that I initially planned on working were pushed behind to accommodate other higher priority issues. Of these, fixing the bug in the schema reading module, adding docker support for the pipeline and writing tests for the pipeline are the most important ones. The other issues are also important and need to be addressed in the future. I will be working on these issues in the coming weeks and would love to see the pipeline being actually used by researchers to supplement their research work.

10. Acknowledgements

I would like to sincerely thank my mentors, the RADAR-Base organization, and the Google Summer of Code Program for giving me a fun and enriching experience this summer. I am extremely thankful for the opportunity to participate in this program and enhance my programming skills while improving the RADAR-Base project.

I am immensely grateful to Amos Folarin, Heet Sankesara and Shaoxiaong Sun for their constant guidance, support and engagement in discussion to clarify my queries.

The RADAR-Base organization has always been welcoming and helpful since I started interacting with them. Their efforts and dedication towards improving people’s quality of life by leveraging clinical value from wearable sensor data and smartphones provided a very conducive environment to contribute. I am grateful and elated to be a part of it.