[PYTHON] I took Apple Watch data into Google Colaboratory and analyzed it

Introduction

** I want to visualize the accumulated Apple Watch data on Colab with pandas and matplotlib **

Daddy I want to be praised by my daughter for being so crazy. .. I tried to do my best with the desire to be respected by my daughter.

The conclusion is first. The work itself is not difficult, but the data is heavy anyway. .. The data capacity (xml) is ** 643MB thanks to the continuous recording for one year! ** ** And Some data does not match the value of iphone / Apple Watch for some reason. My daily active metabolism was 8972 kcal. What? On this day, I was sitting in a chair and just coding, but ... By the way, it seems that it consumes about the same calories as a Gachi triathlon. (Swim: 3.8km + Bike: 180km + Run: 42.195km)

The mystery deepens whether it is an output error or a Pandas operation error. ..

Other than that, the data seemed to be usable, so I decided to take out the step count data and try various things.

Export Apple watch data

Rabbits and horns will not start without first exporting the data. Apple Watch data is stored on the synced iphone. Not only Apple Watch, but also devices associated with "healthcare" apps and logs of third-party apps are integrated into one data.

Now, let's export from the "Healthcare" app on the iphone. For now, there seems to be no other way than exporting from this app.

Export from the "Healthcare" app

  1. Launch "Healthcare" on iphone
  2. Select your icon in the upper right
  3. Select "Export Healthcare Data"
  4. Follow the instructions to select and export the output method.

IMG_6927.PNGIMG_6928.PNGIMG_6929.PNGIMG_6930.PNGIMG_6931.PNG

It took about 10 minutes to export on iphone. The file is zipped, and you can select the output destination such as iCloud or email. In my case, it was a fairly large file, so I transferred it directly to my macbook with AirDrop.

You can extract export.xml by unzipping the zip. This will be the log data of Apple Watch. If it is xml, the data handling is bad, so I will convert it to csv.

Convert data to CSV

A nice conversion program is published on GitHub Gist, so download it ConvertAppleHealthXMLtoCSV.py https://gist.github.com/xiantail/12784626d1c82411e0b986f71d1171ee#file-convertapplehealthxmltocsv-py

Some fixes

ConvertAppleHealthXMLtoCSV.py



#Less than:Comment out lines 33-39
#Since there is no value whose key is value, an error will occur if it is left as it is.
            try:
                float(att_values['value'])
            except ValueError:
                #att_values['value_c'] = att_values['value']
                #att_values['value'] = 0
                continue

#Less than:Line 56
#Arbitrarily change the path according to your environment
if __name__ == '__main__':
    convert_xml_to_csv('export.xml') 

A csv with a dated file name will be generated. export20191021214259.csv (Since it is troublesome, I will call it export.csv)

GoogleColab The official name is "Google Colaboratory". It is a service provided by Google, and in a word, it is a cloud version of "jupyter notebook". I will bring the data here and do it.

There seem to be several ways to get data into Colab, but I did the following:

  1. Upload export.csv to Google Drive
  2. Load Google Drive files from Google Colab

I will omit uploading to Google Drive.

Read files from Google Drive

Load the data uploaded to Google Drive from Clob.

from google.colab import drive
drive.mount('/content/drive')

When you do, you will be prompted to view the link and enter the ʻauthorization code. Follow the link, copy and enter the permission code for your Google account, You should see the googleDrive files in Sidebar> Drive`.

As an aside, Google Colab is in English, but it is a kind service that can reach the itch. If you don't know what it is, and you feel that there seems to be such a function, You can search from code snippet.

スクリーンショット 2019-11-28 15.32.32.png

Just hit drive and it will tell you how to connect to Google Drive. In addition, the code is pasted just by clicking the "Insert" button, which is convenient.

Preprocessing

First, load the standard module.

import pandas as pd
import matplotlib.pyplot as plt

Then load export.csv. For the read path, right-click the file in the sidebar> driver you just imported. A copy path will appear, so it is convenient to copy and paste using that. If the file is too large to load several times, try using the low_memory = False option.

Capture

df = pd.read_csv('/content/drive/My Drive/ColabNotebooks/export20191021214259.csv', low_memory=False)
df.head(3)

スクリーンショット 2019-11-28 17.21.17.png

The format of Apple HealthCare data is as follows.

--type: Data classification within Apple HealthCare --sourceName: Data acquisition source (in this case, data acquired from the linked app "My Water") --sourceVersion: Version number of data acquisition source --unit: unit --creationDate: Data creation date and time --startDate: Start date and time of data acquisition --endDate: End date and time of data acquisition --value: value ** I want this ** --device: Acquisition device (sensor device)

Data organization

The data is heavy and hard to see, so I'll organize the Dataframe a bit.


#Since the device has only NaN or apple Watch, delete it, version is not necessary, so delete it
df_apple = df.drop(["sourceVersion","device"], axis=1)
df_apple = df_apple.loc[:,['type','sourceName','value','unit', 'creationDate', 'startDate', 'endDate']]

#Set the creation date to Index, type conversion to datetime because the date is just a string
df_apple = df_apple.set_index('creationDate')
df_apple.index = pd.to_datetime(df_apple.index, utc=True).tz_convert('Asia/Tokyo')

#Adjust the value. Delete NaN data, delete non-numeric data, type conversion to floating point
df_apple = df_apple.dropna(subset=['value'])
df_apple.drop(df_apple.index[df_apple['value'].str.match('[^0-9]')], inplace=True)
df_apple['value'] = df_apple['value'].astype(float)

#It seems that the type was long, so I removed the common extra parts
df_apple['type'] = df_apple['type'].str.replace('HKQuantityTypeIdentifier','')

#Later, I want to sort and analyze by month, day of the week, etc., so add year, month, day, time, day of the week to Index
df_apple = df_apple.set_index([df_apple.index.year, df_apple.index.month, df_apple.index.day, df_apple.index.hour, df_apple.index.weekday, df_apple.index])
df_apple.index.names = ['year', 'month', 'day', 'hour', 'weekday', 'date']

df_apple.head()

スクリーンショット 2019-11-28 17.36.24.png

It was pretty refreshing.

df_apple.info()

スクリーンショット 2019-11-29 10.16.11.png

I cleaned NaN etc., but the number of data is more than 1.53 million. The date is converted to Index and the value is float type. Even if you convert from xml and organize the data, it still has 87MB or more. .. It's heavy.

Take a look at the division

Let's see what the breakdown of the data is.

print(df_apple['type'].drop_duplicates().to_string(index=False, header=False))
print(df_apple['sourceName'].drop_duplicates().to_string(index=False, header=False))

スクリーンショット 2019-11-29 10.29.38.pngスクリーンショット2019-11-2910.29.25.png

Basically, we will narrow down the data by sourceName.

In some cases, one sourceName is shared by multiple apps and devices, so if you want to check individually, it is better to narrow down the data using type. (Example: If you are taking step counts on both your apple Watch and iphone, but want to use only the data measured by your apple watch, etc.)

analysis

Up to this point, we have been able to organize the data and check the contents. Let's actually look at the contents of the data.

Number of steps

First, create a DataFrame that narrows down only the step count data

#Number of steps:StepCount
#Deleted one because there was a part that was acquired twice in the application
df_step = df_apple[(df_apple['type'] == 'StepCount') & ~(df_apple['sourceName'] == 'Healthcare')]

#The reason is unknown, but 2018 has a lot of errors, so we will focus on 2019 only
df_step = df_step.query("year == '2019'")

Top 10 steps per day

#Total number of steps per day
daily_step = df_step.sum(level=['year', 'month', 'day']).sort_values('value', ascending=False)
print('Number of steps per day')
daily_step.head(10)

スクリーンショット 2019-11-29 12.07.39.png

You're walking a lot, 9/25 in 1st place. Since it is consistent with the iphone data, it does not seem to be an error. When I looked it up, it was the day I went to the teamLab exhibition in Odaiba using my paid vacation and walked around, I see ♪

Every day of the week

I tried to find out what day of the week I was walking the most.


#Total number of steps for each day of the week
#0 Monday-6 Sunday
weekly_step = df_step.sum(level=['weekday']).sort_values('weekday')
weekly_step
plt.figure(figsize=(10,6))
plt.style.use('ggplot')

plt.title("weekly steps")
label = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
# plt.xlabel("week")
# plt.ylabel("steps")
# plt.ylim(450000, 600000)

plt.bar(weekly_step.index, weekly_step.value, tick_label=label, label="steps", align="center")
plt.show()

スクリーンショット 2019-11-29 12.16.43.png image.png

Hmmmm, Since I don't go to work, I have less Saturdays and Sundays, but I often go out with my family on Sundays. On Wednesday, I run a little on a treadmill at the gym on my way home from work, so is it the most? I will move a little more on Saturday.

By month

Let's also look at the monthly step quantity.

#Total number of steps per month
monthly_step = df_step.sum(level=['month']).sort_values('month')
monthly_step
plt.figure(figsize=(15,6))
plt.style.use('ggplot')

plt.title("monthly steps")
label = list(range(1, 11))
# plt.xlabel("month")
# plt.ylabel("steps")

plt.bar(monthly_step.index, monthly_step.value, tick_label=label, label="steps", align="center")
plt.show()

スクリーンショット 2019-11-29 13.41.55.pngimage.png

Since I exported the data in the middle of October, there are few in October. I haven't been exercising because of the flu in January and the New Year's sleep ... It was hot in August and September of this year, so I didn't go out much. Let's devise so that we can exercise indoors next year.

Combined with the number of steps per week It's interesting to understand my behavior pattern.

Calories burned

Finally, I would like to touch on calories burned.

--Basal Metabolism: basal Energy Burned Calories burned for life support without doing anything --Active Metabolism: activeEnergyBurned Calories burned by exercise

As I mentioned at the beginning, the value is ridiculous. Probably because of that, the total value with iphone and apple watch is not consistent. Did you make a mistake in converting xml to csv? Or maybe the specs have changed in Apple HealthCare during the one year of log measurement.

Ichiou, I calculated the activity metabolism and tried to put out the top 10 daily consumption.

スクリーンショット 2019-11-29 13.57.58.png

Obviously funny ... w Even though I was just sitting in a chair and coding, 8972kcal !! (Triathlon Iron Man) Even US special forces may not burn so much calories (´ ゚ д ゚ `)

in conclusion

I noticed later, With the November iOS update, the "healthcare" app was pretty good. It is now displayed in an easy-to-understand manner even within the app.

It seems interesting to analyze not only the activity data of Apple Watch but also the weight scale, sleep time, muscle training log, etc. Thank you for reading this far and wish you a healthy programmer life.

Recommended Posts

I took Apple Watch data into Google Colaboratory and analyzed it
Is it Google Colaboratory?
I analyzed the rank battle data of Pokemon sword shield and visualized it on Tableau
I analyzed whether it was Mr. Nakajima who sang "I" and Yumin who sang "you".
I generated a lot of images like Google Calendar favicon with Python and incorporated it into Vue's project