Vertical box plots on the same chart - python

I have two datasets
Date Daily Frequency
0 2019-01-01 1
1 2019-01-02 5
2 2019-01-03 11
3 2019-01-04 9
4 2019-01-06 1
5 2019-01-07 8
6 2019-01-08 7
7 2019-01-09 4
8 2019-01-10 5
9 2019-01-11 3
and
Date Daily Frequency
0 2020-01-01 1
1 2020-01-02 13
2 2020-01-03 13
3 2020-01-04 4
4 2020-01-06 1
5 2020-01-07 15
6 2020-01-08 11
7 2020-01-09 12
8 2020-01-10 11
9 2020-01-11 4
I would be interested in plotting vertical boxplots on the same charts but one beside the other in order to compare them.
import seaborn as sns
ax = sns.boxplot( y="Daily Frequency", data=df1)
ax1 = sns.boxplot( y="Daily Frequency", data=df2)
but it generates a box plot inside the other one.
Can you please tell me how to create two distinct box blots on the same chart?
Thanks

Try this:
pd.concat([df1,df2], axis=1).boxplot()
Output:

Related

Python - Sum of column values between 2 dates

I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling("10D").sum() but I can't get the syntax right!!
Please please help!
Try:
df["Total Units sold in the last 10 days"] = df.rolling(on="Date", window="10D", closed="both").sum()["Units Sold"]
print(df)
Prints:
Date Units Sold Total Units sold in the last 10 days
0 2019-01-01 5 5.0
1 2019-01-01 4 9.0
2 2019-01-05 1 10.0
3 2019-01-12 3 4.0
4 2019-01-15 2 6.0
5 2019-02-04 7 7.0

Get cumulative mean among groups in Python

I am trying to get a cumulative mean in python among different groups.
I have data as follows:
id date value
1 2019-01-01 2
1 2019-01-02 8
1 2019-01-04 3
1 2019-01-08 4
1 2019-01-10 12
1 2019-01-13 6
2 2019-01-01 4
2 2019-01-03 2
2 2019-01-04 3
2 2019-01-06 6
2 2019-01-11 1
The output I'm trying to get something like this:
id date value cumulative_avg
1 2019-01-01 2 NaN
1 2019-01-02 8 2
1 2019-01-04 3 5
1 2019-01-08 4 4.33
1 2019-01-10 12 4.25
1 2019-01-13 6 5.8
2 2019-01-01 4 NaN
2 2019-01-03 2 4
2 2019-01-04 3 3
2 2019-01-06 6 3
2 2019-01-11 1 3.75
I need the cumulative average to restart with each new id.
I can get a variation of what I'm looking for with a single, for example if the data set only had the data where id = 1 then I could use:
df['cumulative_avg'] = df['value'].expanding.mean().shift(1)
I try to add a group by into it but I get an error:
df['cumulative_avg'] = df.groupby('id')['value'].expanding().mean().shift(1)
TypeError: incompatible index of inserted column with frame index
Also tried:
df.set_index(['account']
ValueError: cannot handle a non-unique multi-index!
The actual data I have has millions of rows, and thousands of unique ids'. Any help with a speedy/efficient way to do this would be appreciated.
For many groups this will perform better because it ditches the apply. Take the cumsum divided by the cumcount, subtracting off the value to get the analog of expanding. Fortunately pandas interprets 0/0 as NaN.
gp = df.groupby('id')['value']
df['cum_avg'] = (gp.cumsum() - df['value'])/gp.cumcount()
id date value cum_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
After a groupby, you can't really chain method and in your example, the shift is not made per group anymore so you would not get the expected result. And there is a problem with index alignment after anyway so you can't create a column like this. So you can do:
df['cumulative_avg'] = df.groupby('id')['value'].apply(lambda x: x.expanding().mean().shift(1))
print (df)
id date value cumulative_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000

Convert column of integers to time in HH:MM:SS format efficiently

I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00
This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object
In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object
A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00

Python Data-wrangling

I have a dataframe in Python below:
print (df)
Date Hour Weight
0 2019-01-01 8 1
1 2019-01-01 16 2
2 2019-01-01 24 6
3 2019-01-02 8 10
4 2019-01-02 16 4
5 2019-01-02 24 12
6 2019-01-03 8 10
7 2019-01-03 16 6
8 2019-01-03 24 5
How can I create a column (New_Col) that will return me the value of 'Hour' for the lowest value of 'Weight' in the day. I'm expecting:
Date Hour Weight New_Col
2019-01-01 8 1 8
2019-01-01 16 2 8
2019-01-01 24 6 8
2019-01-02 8 10 16
2019-01-02 16 4 16
2019-01-02 24 12 16
2019-01-03 8 10 24
2019-01-03 16 6 24
2019-01-03 24 5 24
Use GroupBy.transform with DataFrameGroupBy.idxmin, but first create index by Hour column for values from Hour per minimal Weight per groups:
df['New'] = df.set_index('Hour').groupby('Date')['Weight'].transform('idxmin').values
print (df)
Date Hour Weight New_Col New
0 2019-01-01 8 1 8 8
1 2019-01-01 16 2 8 8
2 2019-01-01 24 6 8 8
3 2019-01-02 8 10 16 16
4 2019-01-02 16 4 16 16
5 2019-01-02 24 12 16 16
6 2019-01-03 8 10 24 24
7 2019-01-03 16 6 24 24
8 2019-01-03 24 5 24 24
Alternative solution:
df['New'] = df['Date'].map(df.set_index('Hour').groupby('Date')['Weight'].idxmin())

Combine and Manipulate two columns as Date using PANDAS

I have an csv file and reading it through pandas:
cols=['DATE(GMT)','TIME(GMT)',DATASET]
df=pd.read_csv('datasets.csv', usecols=cols)
csv file content are as follows:
DATE(GMT) TIME(GMT) DATASET
05-01-2018 0 10
05-01-2018 1 15
05-01-2018 2 21
05-01-2018 3 9
05-01-2018 4 25
05-01-2018 5 7
... ... ...
05-02-2018 14 65
Now I need to combine 'DATE(GMT)','TIME(GMT)' as a single DateTime column. So that I can have only two columns i.e. DATETIME and DATASET
You can add parameter parse_dates to red_csv for datetime column:
df = pd.read_csv('datasets.csv', usecols=cols, parse_dates=['DATE(GMT)'])
print (df.dtypes)
DATE(GMT) datetime64[ns]
TIME(GMT) int64
DATASET int64
dtype: object
And then add Time column converted to_timedelta:
df['DATE(GMT)'] += pd.to_timedelta(df.pop('TIME(GMT)').astype(str), unit='H')
print (df)
DATE(GMT) DATASET
0 2018-05-01 00:00:00 10
1 2018-05-01 01:00:00 15
2 2018-05-01 02:00:00 21
3 2018-05-01 03:00:00 9
4 2018-05-01 04:00:00 25
5 2018-05-01 05:00:00 7
6 2018-05-02 14:00:00 65
EDIT:
There is problem some data are non numeric:
print (df)
DATE(GMT) TIME(GMT) DATASET
0 05-01-2018 0 10
1 05-01-2018 1 15
2 05-01-2018 2 21
3 05-01-2018 3 9
4 05-01-2018 4 25
5 05-01-2018 s 7
6 05-02-2018 a 65
You can find it:
print (df[pd.to_numeric(df['TIME(GMT)'], errors='coerce').isnull()])
DATE(GMT) TIME(GMT) DATASET
5 05-01-2018 s 7
6 05-02-2018 a 65
And then if need repalce it by 0 (with all missing values):
df['TIME(GMT)'] = pd.to_numeric(df['TIME(GMT)'], errors='coerce').fillna(0)
print (df)
DATE(GMT) TIME(GMT) DATASET
0 05-01-2018 0.0 10
1 05-01-2018 1.0 15
2 05-01-2018 2.0 21
3 05-01-2018 3.0 9
4 05-01-2018 4.0 25
5 05-01-2018 0.0 7
6 05-02-2018 0.0 65

Categories