make boxplot with columns from 2 dataframes [python seaborn] - python

I want to make a boxplot like this:
My data looks like this. It's separated into a control and an intervention dataframe.
control_df
# of Days
0 10
1 12
2 30
intervention_df
# of Days
0 2
1 1
2 2
Unfortunately, I'm not able to easily put it into sns.boxplot. Any advice on how to format it to graph appreciated.
MVE below:
import pandas as pd
# This is how my actual data is coming in
data_control = {'# of Days':[10,12,30]}
data_intervention = {'# of Days':[2,1,2]}
control_df = pd.DataFrame(data_control)
intervention_df = pd.DataFrame(data_intervention)
# This is me manually making it better for a boxplot
boxplot_data = {'Type':['control','control','control','intervention','intervention','intervention'],
'# of Days':[10,12,30,2,1,2]}
import seaborn as sns
sns.boxplot(x='Type',y='# of Days',data=boxplot_data)

IIUC, you can use pd.concat with keys parameter and droplevel:
sns.boxplot(data=pd.concat([control_df, intervention_df],
axis=1,
keys=['Control', 'Intervention']).droplevel(1, axis=1))
Output:

Related

How to extract only year(YYYY) from a CSV column with data like YYYY-YY

I am new to Python/Bokeh/Pandas.
I am able to plot line graph in pandas/bokeh using parse_date options.
However I have come across a dataset(.csv) where the column is like below
My code is as below which gives a blank graph if the column 'Year/Ports' is in YYYY-YY form like from 1952-53, 1953-54, 1954-55 etc.
Do I have to extract only the YYYY and plot because that works but I am sure that is not how the data is to be visualized.
If I extract only the YYYY using CSV or Notepad++ tools then there is no issue as the dates are read perfectly and I get a good meaningful line graph
#Total Cargo Handled at Mormugao Port from 1950-51 to 2019-20
import pandas as pd
from bokeh.plotting import figure,show
from bokeh.io import output_file
#read the CSV file shared by GOI
df = pd.read_csv("Cargo_Data_full.csv",parse_dates=["Year/Ports"])
# selecting rows based on condition
output_file("Cargo tracker.html")
f = figure(height=200,sizing_mode = 'scale_width',x_axis_type = 'datetime')
f.title.text = "Cargo Tracker"
f.xaxis.axis_label="Year/Ports"
f.yaxis.axis_label="Cargo handled"
f.line(df['Year/Ports'],df['OTHERS'])
show(f)
You can't use parse_dates in this case, since the format is not a valid datetime. You can use pandas string slicing to only keep the YYYY part.
df = pd.DataFrame({'Year/Ports':['1952-53', '1953-54', '1954-55'], 'val':[1,2,3]})
df['Year/Ports'] = df['Year/Ports'].str[:4]
print(df)
Year/Ports val
0 1952 1
1 1953 2
2 1954 3
From there you can turn it into a datetime if that makes sense for you.
df['Year/Ports'] = pd.to_datetime(df['Year/Ports'])
print(df)
Year/Ports val
0 1952-01-01 1
1 1953-01-01 2
2 1954-01-01 3

group dataframe based on columns

I am new to datascience your help is appreciated. my question is regarding grouping dataframe based on columns so that bar chart will be plotted based on each subject status
my csv file is something like this
Name,Maths,Science,English,sports
S1,Pass,Fail,Pass,Pass
S2,Pass,Pass,NA,Pass
S3,Pass,Fail,Pass,Pass
S4,Pass,Pass,Pass,NA
S5,Pass,Fail,Pass,NA
expected o/p:
Subject,Status,Count
Maths,Pass,5
Science,Pass,2
Science,Fail,3
English,Pass,4
English,NA,1
Sports,Pass,3
Sports,NA,2
You can do this with pandas, not exactly in the same output format in the question, but definitely having the same information:
import pandas as pd
# reading csv
df = pd.read_csv("input.csv")
# turning columns into rows
melt_df = pd.melt(df, id_vars=['Name'], value_vars=['Maths', 'Science', "English", "sports"], var_name="Subject", value_name="Status")
# filling NaN values, otherwise the below groupby will ignore them.
melt_df = melt_df.fillna("Unknown")
# counting per group of subject and status.
result_df = melt_df.groupby(["Subject", "Status"]).size().reset_index(name="Count")
Then you get the following result:
Subject Status Count
0 English Pass 4
1 English Unknown 1
2 Maths Pass 5
3 Science Fail 3
4 Science Pass 2
5 sports Pass 3
6 sports Unknown 2
PS: Going forward, always paste code on what you've tried so far
To match exactly your output, this is what you could do:
import pandas as pd
df = pd.read_csv('c:/temp/data.csv') # Or where ever your csv file is
subjects = ['Maths', 'Science' , 'English' , 'sports'] # Or you could get that as df.columns and drop 'Name'
grouped_rows = []
for eachsub in subjects:
rows = df.groupby(eachsub)['Name'].count()
idx = list(rows.index)
if 'Pass' in idx:
grouped_rows.append([eachsub, 'Pass', rows['Pass']])
if 'Fail' in idx:
grouped_rows.append([eachsub, 'Fail', rows['Fail']])
new_df = pd.DataFrame(grouped_rows, columns=['Subject', 'Grade', 'Count'])
print(new_df)
I must suggest though that I would avoid getting into the for loop. My approach would be just these two lines:
subjects = ['Maths', 'Science' , 'English' , 'sports']
grouped_rows = df.groupby(eachsub)['Name'].count()
Depending on your application, you already have the data available in grouped_rows

Is it possible to apply seasonal_decompose() after a groupby() where date is not the index of the data frame?

So, for a forecasting project, I have a really long Dataframe of multiple time series of the following type (it has a numerical index):
date
time_series_id
value
2015-08-01
0
0
2015-08-02
0
1
2015-08-03
0
2
2015-08-04
0
3
2015-08-01
1
2
2015-08-02
1
3
2015-08-03
1
4
2015-08-04
1
5
My objective, is to add 3 new columns to these dataset, for each individual time series (each id) that correspond to trend, seasonal and resid.
According to the characteristics of the dataset, they tend to have Nans at the start and the end of the dates.
What I was trying to do was the following:
from statsmodels.tsa.seasonal import seasonal_decompose
df.assign(trend = lambda x: x.groupby("time_series_id")["value"].transform(lambda s: s.mask(~s.isna(), other= seasonal_decompose(s[~s.isna()], model='aditive', extrapolate_trend='freq').trend))
The expected output (trend value are not actual values) should be:
date
time_series_id
value
trend
2015-08-01
0
0
1
2015-08-02
0
1
1
2015-08-03
0
2
1
2015-08-04
0
3
1
2015-08-01
1
2
1
2015-08-02
1
3
1
2015-08-03
1
4
1
2015-08-04
1
5
1
But I get the following error message:
AttributeError: 'Int64Index' object has no attribute 'inferred_freq'
In a previous iteration of my code, this worked for my individual time series data frames, since I had embedded the date column as an index of the data frame instead of an additional column, so the "x" that the lambda function takes has already a "date time" index appropriate for the seasonal_decompose function.
df.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
My questions are, first: is it possible to achieve this using groupby? Or other approaches are possible second: is it possible to handle this that doesn't eat much memory? The original dataset I'm working on has approximately 1MM ~ rows, so any help is really welcomed :).
Did one of the already posed solutions work? If so or you found a different solution please share. I tried each without success, but I'm new to Python so probably missing something.
Here is what I came up with, using a for loop. For my dataset it took 8 minutes to decompose 20 million rows consisting of 6,000 different subsets. This works but I wish it were faster.
Date Time
Segment ID
Travel Time(Minutes)
2021-11-09 07:15:00
1
30
2021-11-09 07:30:00
1
18
2021-11-09 07:15:00
2
30
2021-11-09 07:30:00
2
17
segments = set(frame['Segment ID'])
data = pd.DataFrame([])
for s in segments:
df = frame[frame['Segment ID'] == s].set_index('Date Time').resample('H').mean()
comp = sm.tsa.seasonal_decompose(x=df['Travel Time(Minutes)'], period=24*7, two_sided=False)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
#optional columns with some statistics to find outliers and trend changes
df['resid zscore'] = (df['resid'] - df['resid'].mean()).div(df['resid'].std())
df['trend pct_change'] = df.trend.pct_change()
df['trend pct_change zscore'] = (df['trend pct_change'] - df['trend pct_change'].mean()).div(df['trend pct_change'].std())
data = data.append(df.dropna())
where you have lambda x: x.groupby(..., you don't have anything to group; you are telling it to group a row (I believe). You can try a setup like this, perhaps
Here you define a function to act on the group you are sending via the apply() method. Then you should be able to use your original code.
I have not tested this, but I use this setup quite often to work on groups.
def trend_function(x):
# do your lambda function here as you are sending each grouping
x.assign(
trend = lambda x: x["value"].mask(~x["value"].isna(), other =
seasonal_decompose(x["value"][~x["value"].isna()], model='aditive', extrapolate_trend='freq').trend))
return x
dfnew = df.groupby('time_series_id').apply(trend_function)
use extrapolate_trend='freq' as a parameter. you add the trend, seasonal, and residual to a dictionary and plot the dictionary
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
date=['2015-08-01','2015-08-02','2015-08-03','2015-08-04','2015-08-01','2015-08-02','2015-08-03','2015-08-04']
time_series_id=[0,0,0,0,1,1,1,1]
value=[0,1,2,3,2,3,4,5]
df=pd.DataFrame({'date':date,'time_series_id':time_series_id,'value':value})
df['date']=pd.to_datetime(df['date'])
df=df.set_index('date')
print(df)
index_day = df.index.day
value_by_day = df.groupby(index_day)['value'].mean()
fig,ax = plt.subplots(figsize=(12,4))
value_by_day.plot(ax=ax)
plt.title('value by month')
plt.show()
df[['value']].boxplot()
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].hist(ax=ax, bins=5)
plt.show()
fig,ax = plt.subplots(figsize=(12,4))
df[['value']].plot(kind='density', ax=ax)
plt.show()
plt.clf()
fig,ax = plt.subplots(figsize=(12,4))
plt.style.use('seaborn-pastel')
fig = tsaplots.plot_acf(df['value'], lags=4,ax=ax)
plt.show()
decomposition=sm.tsa.seasonal_decompose(x=df['value'],model='additive', extrapolate_trend='freq', period=1)
decomposition.plot()
plt.show()
decomposition_trend=decomposition.trend
ax= decomposition_trend.plot(figsize=(14,2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
plt.show()
I changed the first piece of code according to my scinario.
Here's my code and attached output
data = pd.DataFrame([])
segments = set(subset['Planning_Material'])
for s in segments:
df = subset[subset['Planning_Material'] == s].set_index('Cal_year_month').resample('M').sum()
comp = sm.tsa.seasonal_decompose(df)
df = df.join(comp.trend).join(comp.seasonal).join(comp.resid)
df['Planning_Material'] = s
data = pd.concat([data,df])
data = data.reset_index()
data = data[['Planning_Material', 'Cal_year_month', 'Historical_demand', 'trend','seasonal','resid']]
data

Pandas, read in file without a separator between columns

I want to read in a file that looks like this:
1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09
So, each column is denoted by a 15 character sequence, but there's no official separator. Does pandas have a way of doing this?
Yes! its called pd.read_fwf
from io import StringIO
import pandas as pd
txt = """ 1.49998061E-01 2.49996769E-01 3.99994830E-01 5.99992245E-01 9.99987075E-01
1.49998061E+00 2.49996769E+00 5.99992245E+00 9.99987075E+00 1.99997415E+01
4.99993537E+01 9.99987075E+01 .00000000E+00-2.70636350E+03-6.37027451E+03
-1.97521328E+04-4.64928272E+04-1.09435407E+05-3.39323088E+05-7.98702345E+05
-1.87999269E+06-5.82921376E+06-1.37207895E+07-2.26385807E+07-4.25429547E+07
-7.60167523E+07-1.25422049E+08-2.35690283E+08-3.88862033E+08-7.30701955E+08
-1.30546599E+09-2.15348023E+09-4.04455001E+09-4.54896210E+09-5.32533888E+09"""
pd.read_fwf(StringIO(txt), widths=[15] * 5, header=None)
0 1 2 3 4
0 1.499981e-01 2.499968e-01 3.999948e-01 5.999922e-01 9.999871e-01
1 1.499981e+00 2.499968e+00 5.999922e+00 9.999871e+00 1.999974e+01
2 4.999935e+01 9.999871e+01 0.000000e+00 -2.706363e+03 -6.370275e+03
3 -1.975213e+04 -4.649283e+04 -1.094354e+05 -3.393231e+05 -7.987023e+05
4 -1.879993e+06 -5.829214e+06 -1.372079e+07 -2.263858e+07 -4.254295e+07
5 -7.601675e+07 -1.254220e+08 -2.356903e+08 -3.888620e+08 -7.307020e+08
6 -1.305466e+09 -2.153480e+09 -4.044550e+09 -4.548962e+09 -5.325339e+09
Let's look at using pd.read_fwf:
df = pd.read_fwf(csv_file,widths=[15]*5,header=None)
Let's do like that:
for example: housing.data
dataset = pd.read_csv('c:/1/housing.data', engine = 'python', sep='\s+', header=None)

Big data visualization for multiple sampled data points from a large log

I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable:
2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...]
2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ]
in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish'
so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it.
Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.
Getting the data into pandas:
import pandas as pd
df = pd.DataFrame(columns = ['timestamp','name','value'])
with open(logfilepath) as f:
for line in f.readlines():
timestamp = line.split(',')[0]
#the data part of each line can be evaluated directly as a Python list
data = eval(line.split('=')[1])
#convert the input data from wide format to long format
for name, value in data:
df = df.append({'timestamp':timestamp, 'name':name, 'value':value},
ignore_index = True)
#convert from long format back to wide format, and fill null values with 0
df2 = df.pivot_table(index = 'timestamp', columns = 'name')
df2 = df2.fillna(0)
df2
Out[142]:
value
name ball cod fish keypassed mouse toy
timestamp
2015-06-20 16:42:48 2 0 0 13 1 2
2015-06-21 16:42:48 7 1 1 20 0 5
Plot the data:
import matplotlib.pylab as plt
df2.value.plot()
plt.show()

Categories