Boxplot Pandas data - python

DataFrame is as follows:
ID1 ID2
0 00:00:01.002 00:00:01.002
1 00:00:01.001 00:00:01.006
2 00:00:01.004 00:00:01.011
3 00:00:00.998 00:00:01.012
4 NaT 00:00:01.000
...
20 NaT 00:00:00.998
What I am trying to do is create a boxplot for each ID. There may or may not be multiple IDs depending on the dataset I provide. For right now I am trying to solve this for 2 datasets. If possible I would like a solution that has all the data on the same boxplot and then another with the data displayed on its own boxplot per ID.
I am very new to pandas (trying to learn it...) and am just getting frustrated at how long this is taking to figure out... Here is my code...
deltaTime = pd.DataFrame() #Create blank df
for x in range(0, len(totIDs)):
ID = IDList[x]
df = pd.DataFrame(data[ID]).T
deltaT[ID] = pd.to_datetime(df[TIME_COL]).diff()
deltaT.boxplot()
Pretty simple just cant seem to get it do what I want in plotting a boxplot for each ID. I should not that data is given to me by a homegrown file reader that takes several complex files and sorts them into the data dictionary which is indexed by IDs.
I am running pandas version 0.14.0 and python version 2.7.7

I am not sure how this works in 0.14.0 version, because last is 0.19.2 - I recommend upgrade if possible:
#sample data
np.random.seed(180)
dates = pd.date_range('2017-01-01 10:11:20', periods=10, freq='T')
cols = ['ID1','ID2']
df = pd.DataFrame(np.random.choice(dates, size=(10,2)), columns=cols)
print (df)
ID1 ID2
0 2017-01-01 10:12:20 2017-01-01 10:17:20
1 2017-01-01 10:16:20 2017-01-01 10:20:20
2 2017-01-01 10:18:20 2017-01-01 10:17:20
3 2017-01-01 10:12:20 2017-01-01 10:16:20
4 2017-01-01 10:14:20 2017-01-01 10:18:20
5 2017-01-01 10:18:20 2017-01-01 10:19:20
6 2017-01-01 10:17:20 2017-01-01 10:12:20
7 2017-01-01 10:13:20 2017-01-01 10:17:20
8 2017-01-01 10:16:20 2017-01-01 10:11:20
9 2017-01-01 10:13:20 2017-01-01 10:19:20
Call DataFrame.diff and then convert timedeltas to total_seconds:
df = df.diff().apply(lambda x: x.dt.total_seconds())
print(df)
ID1 ID2
0 NaN NaN
1 240.0 180.0
2 120.0 -180.0
3 -360.0 -60.0
4 120.0 120.0
5 240.0 60.0
6 -60.0 -420.0
7 -240.0 300.0
8 180.0 -360.0
9 -180.0 480.0
Last use DataFrame.plot.box
df.plot.box()
You can also check docs.

Related

Is it possible to use Pandas Overlap in a Dataframe?

Python 3.7,
Pandas 25
I have a Pandas Dataframe with columns for startdate and enddate. I am looking for ranges that overlap the range of my variable(s). Without being verbose and composing a series of greater than/less than statements with ands/ors to filter out the rows I need, I would like to use some sort of interval "overlap". It appears Pandas has this functionality:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Interval.overlaps.html
The following test works:
range1 = pd.Interval(pd.Timestamp('2017-01-01 00:00:00'),pd.Timestamp('2018-01-01 00:00:00'),closed='both')
range2 = pd.Interval(pd.Timestamp('2016-01-01 00:00:00'),pd.Timestamp('2017-01-01 00:00:00'),closed='both')
range1.overlaps(range2)
However, when I go to apply it to the dataframe columns it does not. I am not sure if there is something wrong in my syntax, or if this simply can not be applied to a dataframe. Here are some of the things I have tried (and received the gamut of errors):
start_range = '2017-07-01 00:00:00'
end_current = '2019-07-01 00:00:00'
reporttest_range = pd.Interval(pd.Timestamp(start_range),pd.Timestamp(end_current),closed='both')
reporttest_filter = my_dataframe[my_dataframe['startdate']['enddate'].overlaps(reporttest_range)]
reporttest_filter = my_dataframe[my_dataframe['startdate','enddate'].overlaps(reporttest_range)]
reporttest_filter = my_dataframe[(my_dataframe['startdate','enddate']).overlaps(reporttest_range)]
reporttest_filter = my_dataframe.filter(['startdate','enddate']).overlaps(reporttest_range)
reporttest_filter = my_dataframe.filter['startdate','enddate'].overlaps(reporttest_range)
reporttest_filter = my_dataframe.filter(['startdate','enddate']).overlaps(reporttest_range)
print(reporttest_filter)
Can someone please point me to an efficient way to accomplish this?
As requested, the dataframe output looks like this:
record startdate enddate
0 99 2017-07-01 2018-06-30
1 280 2018-08-01 2021-07-31
2 100 2017-07-01 2018-06-30
3 281 2017-07-01 2018-06-30
You need to create IntervalIndex from df.startdate and df.enddate and use overlaps against reporttest_range. Your sample returns all true, so I add row for False case.
Sample df:
record startdate enddate
0 9931 2017-07-01 2018-06-30
1 28075 2018-08-01 2021-07-31
2 10042 2017-07-01 2018-06-30
3 28108 2017-07-01 2018-06-30
4 28109 2016-07-01 2016-12-30
5 28111 2017-07-02 2018-09-30
iix = pd.IntervalIndex.from_arrays(df.startdate, df.enddate, closed='both')
iix.overlaps(reporttest_range)
Out[400]: array([ True, True, True, True, False, True])
Use it to pick only overlapping rows
df[iix.overlaps(reporttest_range)]
Out[401]:
record startdate enddate
0 9931 2017-07-01 2018-06-30
1 28075 2018-08-01 2021-07-31
2 10042 2017-07-01 2018-06-30
3 28108 2017-07-01 2018-06-30
5 28111 2017-07-02 2018-09-30

Pandas custom re-sample for time series data

I have a time series data in 1 Min frequency. I would like re-sample the data for every 5 min and re-sample data should include the data of first time step, middle time step and last time step.
I have tried like this, but I am not getting what I am expecting...
def my_fun(array)
return array[0],array[-1]
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'
df.resample('5T').apply(my_fun)
If I understood you correctly then you want the data for minutes 0,2,4,5,7,9,10,... in a new dataframe. A faster way than using resample may be:
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
l = len(df)
df.loc[df.iloc[range(2,l,5)].index | df.iloc[range(4,l,5)].index | df.iloc[range(0,l,5)].index]
Output:
0
2017-01-01 00:00:00 0
2017-01-01 00:02:00 2
2017-01-01 00:04:00 4
2017-01-01 00:05:00 5
2017-01-01 00:07:00 7
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
If you just wanted a combined list of your selected data in one row then you were almost there:
def my_fun(array):
return [array[0], array[2], array[4]]
df=pd.DataFrame({'0':np.arange(60)}, index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
df.resample('5T').apply(my_fun)
Output:
0
2017-01-01 00:00:00 (0, 2, 4)
2017-01-01 00:05:00 (5, 7, 9)
2017-01-01 00:10:00 (10, 12, 14)

Add time interval values in new column Pandas

I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks
I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.
threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')
i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

How to access last element of a multi-index dataframe

I have a dataframe with IDs and timestamps as a multi-index. The index in the dataframe is sorted by IDs and timestamps and I want to pick the lastest timestamp for each IDs. for example:
IDs timestamp value
0 2010-10-30 1
2010-11-30 2
1 2000-01-01 300
2007-01-01 33
2010-01-01 400
2 2000-01-01 11
So basically the result I want is
IDs timestamp value
0 2010-11-30 2
1 2010-01-01 400
2 2000-01-01 11
What is the command to do that in pandas?
Given this setup:
import pandas as pd
import numpy as np
import io
content = io.BytesIO("""\
IDs timestamp value
0 2010-10-30 1
0 2010-11-30 2
1 2000-01-01 300
1 2007-01-01 33
1 2010-01-01 400
2 2000-01-01 11""")
df = pd.read_table(content, header=0, sep='\s+', parse_dates=[1])
df.set_index(['IDs', 'timestamp'], inplace=True)
using reset_index followed by groupby
df.reset_index(['timestamp'], inplace=True)
print(df.groupby(level=0).last())
yields
timestamp value
IDs
0 2010-11-30 00:00:00 2
1 2010-01-01 00:00:00 400
2 2000-01-01 00:00:00 11
This does not feel like the best solution, however. There should be a way to do this without calling reset_index...
As you point out in the comments, last ignores NaN values. To not skip NaN values, you could use groupby/agg like this:
df.reset_index(['timestamp'], inplace=True)
grouped = df.groupby(level=0)
print(grouped.agg(lambda x: x.iloc[-1]))
One can also use
df.groupby("IDs").tail(1)
This will take the last row of each label in level "IDs" and will not ignore NaN values.

Categories