Pandas: parallel plots using groupie - python

I was wondering if anyone could help me with parallel coordinate plotting.
First this is how the data looks like:
It's data manipulated from : https://data.cityofnewyork.us/Transportation/2016-Yellow-Taxi-Trip-Data/k67s-dv2t
So I'm trying to normalise some features and use that to compute the mean of trip distance, passenger count and payment amount for each day of the week.
from pandas.tools.plotting import parallel_coordinates
feature = ['trip_distance','passenger_count','payment_amount']
#normalizing data
for feature in features:
df[feature] = (df[feature]-df[feature].min())/(df[feature].max()-df[feature].min())
#change format to datetime
pickup_time = pd.to_datetime(df['pickup_datetime'], format ='%d/%m/%y %H:%M')
#fill dayofweek column with 0~6 0:Monday and 6:Sunday
df['dayofweek'] = pickup_time.dt.weekday
mean_trip = df.groupby('dayofweek').trip_distance.mean()
mean_passanger = df.groupby('dayofweek').passenger_count.mean()
mean_payment = df.groupby('dayofweek').payment_amount.mean()
#parallel_coordinates('notsurewattoput')
So if I print mean_trip:
It shows the mean of each day of the week but I'm not sure how I would use this to draw a parallel coordinate plot with all 3 means on the same plot.
Does anyone know how to implement this?

I think you can change 3 times aggregating mean to one with output DataFrame instead 3 Series:
mean_trip = df.groupby('dayofweek').trip_distance.mean()
mean_passanger = df.groupby('dayofweek').passenger_count.mean()
mean_payment = df.groupby('dayofweek').payment_amount.mean()
to:
from pandas.tools.plotting import parallel_coordinates
cols = ['trip_distance','passenger_count','payment_amount']
df1 = df.groupby('dayofweek', as_index=False)[cols].mean()
#https://stackoverflow.com/a/45082022
parallel_coordinates(df1, class_column='dayofweek', cols=cols)

Related

compute and plot monthly mean SST anomalies and plot with xarray multindex (pangeo tutorial gallery)

I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()

Time series dataframe. Plot two distinct time ranges with a relative starting point

I have a dataframe with a time series (daily prices of a single stock). I want to take two distinct time ranges and overlay them on plot with a relative starting point of 0 instead of a date.
In the example below, if I plot 1962 and 2018, it uses the date as the x axis instead of a relative starting point.
SPY = pd.read_csv('GSPC.csv', parse_dates=['dDate'], index_col='dDate')
SPY1962 = SPY['1962']
SPY2018 = SPY['2018']
firstprice62 = SPY1962['nAdjClose'].iloc[0]
firstprice18 = SPY2018['nAdjClose'].iloc[0]
normal62 = SPY1962['nAdjClose'].div(firstprice62).mul(100)
normal18 = SPY2018['nAdjClose'].div(firstprice18).mul(100)
A picture of what I'm trying to accomplish
Figured it out.
normal18 = normal18.reset_index()
normal62 = normal62.reset_index()
normal62['nAdjClose'].plot()
normal18['nAdjClose'].plot()
plt.show()

Python pandas grouping for correlation analysis

Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')

Scatter plot for non numeric data

I am learning to use matplotlib with pandas and I am having a little trouble with it. There is a dataframe which has districts and coffee shops as its y and x labels respectively. And the column values represent the start date of the coffee-shops in respective districts
starbucks cafe-cool barista ........ 60 shops
dist1 2008-09-18 2010-05-04 2007-02-21 ...............
dist2 2007-06-12 2011-02-17
dist3
.
.
100 districts
I want to plot a scatter plot with x axis as time series and y axis as coffee-shops. Since I couldn't figure out a direct one line way to plot this, I extracted the coffee-shops as one list and dates as other list.
shops = list(df.columns.values)
dt = pd.DataFrame(df.ix['dist1'])
dates = dt.set_index('dist1')
First I tried plt.plot(dates, shops). Got a ZeroDivisionError: integer division or modulo by zero - error. I could not figure out the reason for it. I saw on some posts that the data should be numeric, so I used ytick function.
y = [1, 2, 3, 4, 5, 6,...60]
still plt.plot(dates, y) threw same ZeroDivisionError. If I could get past this may be I would be able to plot using tick function. Source -
http://matplotlib.org/examples/ticks_and_spines/ticklabels_demo_rotation.html
I am trying to plot the graph for only first row/dist1. For that I fetched the first row as a dataframe df1 = df.ix[1] and then used the following
for badges, dates in df.iteritems():
date = dates
ax.plot_date(date, yval)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(badges)
yval+=1
.
I got an error at line ax.plot_date(date, yval) saying x and y should be have same first dimension. Since I am plotting one by one for each coffe-shop for dist1 shouldn't the length always be one for both x and y? PS: date is a datetime.date object
To achieve this you need to convert the dates to datetimes, see here for
an example. As mentioned you also need to convert the coffee shops into
some numbering system then change the tick labels accordingly.
Here is an attempt
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from datetime import datetime
def get_datetime(string):
"Converts string '2008-05-04' to datetime"
return datetime.strptime(string, "%Y-%m-%d")
# Generate datarame
df = pd.DataFrame(dict(
starbucks=["2008-09-18", "2007-06-12"],
cafe_cool=["2010-05-04", "2011-02-17"],
barista=["2007-02-21"]),
index=["dist1", "dist2"])
ax = plt.subplot(111)
label_list = []
label_ticks = []
yval = 1 # numbering system
# Iterate through coffee shops
for coffee_shop, dates in df.iteritems():
# Convert strings into datetime list
datetimes = [get_datetime(date) for date in dates]
# Create list of yvals [yval, yval, ...] to plot against
yval_list = np.zeros(len(dates))+yval
ax.plot_date(datetimes, yval_list)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(coffee_shop)
yval+=1 # Change the number so they don't all sit at the same y position
# Now set the yticks appropriately
ax.set_yticks(label_ticks)
ax.set_yticklabels(label_list)
# Set the limits so we can see everything
ax.set_ylim(ax.get_ylim()[0]-1,
ax.get_ylim()[1]+1)

Plotting a discrete variable over time (scarf plot)

I have time series data from a repeated-measures eyetracking experiment.
The dataset consists of a number of respondents and for each respondent, there is 48 trials.
The data set has a variable ('saccade') which is the transitions between eye-fixations and a variable ('time') which ranges for 0-1 for each trial. The transitions are classified into three different categories ('ver', 'hor' and 'diag').
Here is a script that will create a small example data set in python (one participant and two trials):
import numpy as np
import pandas as pd
saccade1 = np.array(['diag','hor','ver','hor','diag','ver','hor','diag','diag',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','hor','hor','diag',
'diag','ver','ver','ver','ver'])
time1 = np.array(range(len(saccade1)))/float(len(saccade1)-1)
trial1 = [1]*len(time1)
saccade2 = np.array(['diag','ver','hor','diag','diag','diag','hor','ver','hor',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','diag',
'diag','hor','hor','diag','diag','ver','ver','ver','ver','hor','diag','diag'])
time2 = np.array(range(len(saccade2)))/float(len(saccade2)-1)
trial2 = [2]*len(time2)
saccade = np.append(saccade1,saccade2)
time = np.append(time1,time2)
trial = np.append(trial1,trial2)
subject = [1]*len(time)
df = pd.DataFrame(index=range(len(subject)))
df['subject'] = subject
df['saccade'] = saccade
df['trial'] = trial
df['time'] = time
Alternatively I have made a csv-file with the same data which can be downloaded here
I would like to be able to make a so-called scarf plot to visualize the sequence of transitions over time, but I have no clue how to make these plots.
I would like plots (for each participant separately) where time is on the x-axis and trial is on the y-axis. For each trial I would like the transitions represented as colored "stacked" bars.
The only example I have of these kinds of plots are in the book "Eye Tracking - A comprehensive guide to methods and measures" (fig. 6.8b) link
Can anyone tell/help me in doing this?
(I can deal which python or R programming - preferably python)
Here is a solution in R using ggplot2. You need to recode time2 so that it indicates the enlapsed time instead of the total time.
library(ggplot2)
dataset <- read.csv("~/Downloads/example_data_for_scarf.csv")
dataset$trial <- factor(dataset$trial)
dataset$saccade <- factor(dataset$saccade)
dataset$time2 <- c(0, diff(dataset$time))
dataset$time2[dataset$time == 0] <- 0
ggplot(dataset, aes(x = trial, y = time2, fill = saccade)) +
geom_bar(stat = "identity") +
coord_flip()

Categories