This article is loosely based on a time series challenge from customer data. I have fabricated 3 data files such that they represent the same challenge and we will go through the process of discovering that data. The primary challenge in this data set is that it is from a sleep study and the researchers left the date portion of the time stamp off. What this means is that at midnight, the data plots at the beginning of the x-axis. The second challenge is lining up data to see if there is anything interesting with the time. So yes, you can simply plot using the index that python generates, however I’m also interested in the actual time itself as this is a study involving humans.
We are going to use a series of packages. Make sure install these into your environment.
import numpy as np import pandas as pd import plotly import plotly.plotly as py import plotly.graph_objs as go from datetime import datetime
Numpy is a numerical computing library. Pandas is our data frame manipulator. Plotly is the best charting library out there and we need some dates.
Load the Data
First download the data from my Azure Storage. Subject1 Data. Subject2 Data. Subject3 Data. A very quick note about the data. It contains absolutely no real data, this is purely fabricated data to demonstrate a few challenges. The Temp field is simply a Gaussian distribution centered at 10.
Below is the code to load the data. There isn’t much special about this.
#Prep Notebook for offline plotting plotly.offline.init_notebook_mode() #Build paths basePath = "C:\\data\\Fabricated\\" subj1_Path = "subject1.csv" subj2_Path = "subject2.csv" subj3_Path = "subject3.csv" #Load Data subj1Data = pd.read_csv(basePath + subj1_Path, index_col = 0) subj2Data = pd.read_csv(basePath + subj2_Path, index_col = 0) subj3Data = pd.read_csv(basePath + subj3_Path, index_col = 0)
Helper Manipulation Functions
I’m not a fan of repeating the same code a ton of times, so its best to knock out all of the manipulation code in a single function. Here we have 2 functions that are used.
def Day7 (x): if x.hour < 12: return datetime(year = 1975, month = 5, day = 7, hour = x.hour, minute = x.minute, second = x.second) return x def CleanData (data): data['Time'] = data['Time'].map(lambda x : pd.to_datetime(x, format="%I:%M:%S %p").time()) \ .map(lambda x: datetime(year = 1975, month = 5, day = 6, hour = x.hour, minute = x.minute, second = x.second)) \ .map(lambda x: Day7(x)) return data.set_index(data['Time']).drop('Time', 1)
Lets start with the function “CleanData”. Basically I’m a huge fan of high order functions (F# background). .map allows us to clean data in a way that is more easily read. Also notice that I use the “\” to continue a line. This allows me to achieve a style that is more in line with what we do in R and F# by putting a new high order function on each line making the manipulation code more easily read.
The challenge in the data set really is around handling the time. The first map function converts the string into a DateTime. Notice we use %I instead of %H. %I will allow you to use %p to maintain am/pm if you decide to convert to 24 hour instead of 12 hour representation.
The second map function takes the previous and transforms it into a new date time which includes a fabricated year, month and day. The output of this is in the 24 hour format. This is important because remember, these studies happen on the barrier between two days. This is the beginnings of being able to fabricate a day that might keep our data in line.
The third map function really just uses this Day7 function. The reason we use Day7 is it appears that lambda does not allow a multi-line function in Python, lame, but oh well. What is needed for this is that folks go to sleep at night (hours 19-24 or so). So really, what we need to do is grab any data that is before noon and put that 1 day ahead of the data which is hours greater than noon. Of course there are some assumptions there, such as folks wake up before noon and go to sleep after noon. I suppose as long as you aren’t dealing with teenagers you should be ok.
Finally we bump the time to be the index and drop the time column altogether. The reason we do this is that we want to take advantage of pythons windowing and computational tools on series. If your time is your index, Pandas will manage conversion between data, number and back again for you and everything is happy. If you don’t set the date as your index, Python gives you an exception “I don’t know how to do that”.
Clean and Window
Turns out that the data is really noisy, so I wanted to plot a few different window strategies to see what might be more optimal while maintaining core data shape.
#Format Raw Data for Render subj1Data = CleanData(subj1Data) subj2Data = CleanData(subj2Data) subj3Data = CleanData(subj3Data) #Windows Rocks window = 1000 subj1Data_W = subj1Data.rolling(window=window, center = True).mean().dropna() subj2Data_W = subj2Data.rolling(window=window, center = True).mean().dropna() subj3Data_W = subj3Data.rolling(window=window, center = True).mean().dropna()
Basically we are doing a rolling mean window with a window size of 1,000 seconds as the data is set on 1,000. I want to maintain the center as we will be plotting this on top of the raw data. We drop na, because doing this inherently reduces the size of our data set, and its just good practice to drop nulls.
Prep for Charting
Alright! Lets get this stuff ready for rendering…
def CreateRawTrace (data, c): return go.Scatter( x = data.index, y = data['Temp'], line = dict( width = 1, color = c, dash = 'dash') ) def CreateWindTrace (data, c): return go.Scatter( x = data.index, y = data['Temp'], line = dict( color = c, width = 2) ) subj1_Trace_R = CreateRawTrace(subj1Data, 'rgb(205, 12, 24)') subj2_Trace_R = CreateRawTrace(subj2Data, 'rgb(12, 205, 24)') subj3_Trace_R = CreateRawTrace(subj3Data, 'rgb(12, 24, 205)') subj1_Trace_W = CreateWindTrace(subj1Data_W, 'rgb(205, 12, 24)') subj2_Trace_W = CreateWindTrace(subj2Data_W, 'rgb(12, 205, 24)') subj3_Trace_W = CreateWindTrace(subj3Data_W, 'rgb(12, 24, 205)') data = [subj1_Trace_R, subj2_Trace_R, subj3_Trace_R, \ subj1_Trace_W, subj2_Trace_W, subj3_Trace_W] plotly.offline.plot(data, filename='subjects')
You can also swap plotly to work in online mode and upload to plotly if you want to push to a blog or whatever.
So we covered quite a bit here. Lambda functions, more plotting, how to deal with time series in python, a bit on pandas. Making some good progress.