Your goal for the UCL Data Science Student Challenge is to hack & develop an innovative data science solution to improve the lives of Londoners while demonstrating the use of Azure Machine Learning, Microsoft’s data science tool.

The presentation from Jacob Spoelstra this morning, including an Azure Machine Learning walkthrough, is available online here.

This Azure Machine Learning tutorial will get you started using the PEACH dataset provided for this hackathon.

You can activate your Azure passes by going to this website – www.microsoftazurepass.com

Open Data

You can use any public datasets for your solution, including:

Special datasets for this weekend

We have made two large datasets available specially for this weekend, so explore these to help you create an amazing solution. Each dataset is stored in Microsoft Azure (in blob storage) and can be accessed via a unique URL.

Project PEACH Datasets

These datasets are provided by Project PEACH (https://code-4-health.org/peach), a large-scale open source community driven Data Science project by the Computer Science Department at University College London. It contains several related datasets around users lifestyle, shopping habits, and social media activity. It provides a very rich basis for exploring innovative Azure Machine Learning experiments to create compelling data science applications.

You can download the entire sample dataset (58MB) for initial development of your solution at – http://uclhack.blob.core.windows.net/peach/PEACHData.zip

The full datasets include the complete data you should use for training your machine learning solution. These can be up to 3GB in size. You should use these URLs to load data into Azure Machine Learning. We do not recommend that you download these to your local machine over WiFi, but you should use them directly in Azure Machine Learning Studio, using the URL provided.

Atmospheric Factors dataset

This includes data on users responses to different atmospheric factors, including as UV radiation, pollen count and air quality. It includes the user’s response (positive or negative) to each factor.

Food diary

This dataset contains users’ food diary entries, including what they have eaten, and any potential allergens in the food.

Internet of Things wearables data

This dataset includes measurements of users’ body temperature, elevated heart rate and sleep quality.

These are the full datasets every 3 months (each file 50MB-500MB). We do not recommend that you download these to your local machine over WiFi, but you should use them directly in Azure Machine Learning Studio, using the URL provided.

Retail shopping data

This data includes what users purchased from different shops, including supermarkets, pharmacies, athletic apparel and outdoor gear.This could be tracked via store loyalty programs or OCR receipt scanning applications.

Retail shopping full datasets every 3 months (each file 50MB-500MB). We do not recommend that you download these to your local machine over WiFi, but you should use them directly in Azure Machine Learning Studio, using the URL provided.

Social media data

This data is of users’ tweets.

User profile data

This data contains the profile of every user in the PEACH dataset, including demographic, health and lifestyle data. It does not change over time.

 

Nuffield Health Datasets

Anonymised datasets around lifestyle, profile and health of many users is available to you. This rich dataset provides exciting opportunities for you to build data science solutions around health and well-being. If you would like access to these datasets then please email kenji.takeda AT microsoft.com.