Python dataframe column update, replace as per where condition help required - python

Below are SQL queries to update Date in new format
update data set Date=[Time Period]+'-01-01' where Frequency='0'
update data set Date=replace([Time Period],'Q1','-01-01')
where Frequency='2' and substring([Time Period],5,2)='Q1'
update data set Date=replace([Time Period],'Q2','-04-01')
where Frequency='2' and substring([Time Period],5,2)='Q2'
update data set Date=replace([Time Period],'Q3','-07-01')
where Frequency='2' and substring([Time Period],5,2)='Q3'
update data set Date=replace([Time Period],'Q4','-10-01')
where Frequency='2' and substring([Time Period],5,2)='Q4'
update data set Date=replace([Time Period],'M','-')+'-01'
where Frequency='3' and len([Time Period])=7
update data set Date=replace([Time Period],'M','-0')+'-01'
where Frequency='3' and len([Time Period])=6
Now I have loaded same data into python data frame,
Sample data from data frame with comma separated.
Column: Time Period is the input data and Date column is output date, I need to convert Time Period to column Date format.
Frequency,Time Period,Date
0,2008,2008-01-01
0,1961,1961-01-01
2,2009Q1,2009-04-01
2,1975Q4,1975-10-01
2,2007Q3,2007-04-01
2,1959Q4,1959-10-01
2,1965Q4,1965-07-01
2,2008Q3,2008-07-01
3,1969M2,1969-02-01
3,1994M12,1994-12-01
3,1990M1,1990-01-01
3,1994M10,1994-10-01
3,2012M11,2012-11-01
3,1994M3,1994-03-01
Please let me know how to update Date as per above condition in python.

This is bit tricky to use a vectorized apparoach when adding different offsets.
Consider the following approach:
Source DF:
In [337]: df
Out[337]:
Frequency Time Period
0 0 2008
1 0 1961
2 2 2009Q1
3 2 1975Q4
4 2 2007Q3
5 2 1959Q4
6 2 1965Q4
7 2 2008Q3
8 3 1969M2
9 3 1994M12
10 3 1990M1
11 3 1994M10
12 3 2012M11
13 3 1994M3
Solution:
In [338]: %paste
df[['y','mm']] = (df['Time Period']
.replace(['Q1', 'Q2', 'Q3', 'Q4'],
['M0', 'M3', 'M6', 'M9'],
regex=True)
.str.extract('(\d{4})M?(\d+)?', expand=True))
df['Date'] = (pd.to_datetime(df.pop('y'), format='%Y', errors='coerce')
.values.astype('M8[M]') \
+ \
pd.to_numeric(df.pop('mm'), errors='coerce') \
.fillna(0).astype(int).values * np.timedelta64(1, 'M')) \
.astype('M8[D]')
## -- End pasted text --
Result:
In [339]: df
Out[339]:
Frequency Time Period Date
0 0 2008 2008-01-01
1 0 1961 1961-01-01
2 2 2009Q1 2009-01-01
3 2 1975Q4 1975-10-01
4 2 2007Q3 2007-07-01
5 2 1959Q4 1959-10-01
6 2 1965Q4 1965-10-01
7 2 2008Q3 2008-07-01
8 3 1969M2 1969-03-01
9 3 1994M12 1995-01-01
10 3 1990M1 1990-02-01
11 3 1994M10 1994-11-01
12 3 2012M11 2012-12-01
13 3 1994M3 1994-04-01
EDIT by Scott Boston please remove if you find a better way.
df[['y','mm']] = (df['Period']
.replace(['Q1', 'Q2', 'Q3', 'Q4'],
['M1', 'M4', 'M7', 'M10'],
regex=True)
.str.extract('(\d{4})M?(\d+)?', expand=True))
df['Date'] = (pd.to_datetime(df.pop('y'), format='%Y', errors='coerce')
.values.astype('M8[M]') \
+ \
pd.to_numeric(df.pop('mm'), errors='coerce') \
.fillna(1).astype(int).values - 1 * np.timedelta64(1, 'M')) \
.astype('M8[D]')
Output:
Frequency Time Period Date
0 0 0 2008 2008-01-01
1 1 0 1961 1961-01-01
2 2 2 2009Q1 2009-01-01
3 3 2 1975Q4 1975-10-01
4 4 2 2007Q3 2007-07-01
5 5 2 1959Q4 1959-10-01
6 6 2 1965Q4 1965-10-01
7 7 2 2008Q3 2008-07-01
8 8 3 1969M2 1969-02-01
9 9 3 1994M12 1994-12-01
10 10 3 1990M1 1990-01-01
11 11 3 1994M10 1994-10-01
12 12 3 2012M11 2012-11-01
13 13 3 1994M3 1994-03-01

Related

Pandas/Python groupby and then calculate mean for another column within each group

I have a table as the following. I want to group all the rows by the values in ToD and then calculate the mean for LOS for all the rows in the same group.
This is how I created the DataFrame
df_sim = pd.DataFrame(columns=['ID','Type','Type_n','FT','ta','DoW','ToD','t_departure'])
I tried the following.
df_sim.groupby('ToD').LOS.mean()
The error I get is DataError: No numeric types to aggregate.
What is confusing to me is the following works. Im my mind, I now take the sum instead of mean.
df_sim[['ToD','LOS']].groupby('ToD').sum()
ID Type Type_n FT ta DoW ToD t_departure LOS
0 0 ESI4 4 0 0.648446 Sun 0 3.87411 3.22567
1 1 ESI2 1 0 0.663228 Sun 0 1.42772 0.764489
2 2 A-ESI3 2 0 4.72354 Sun 4 4.90432 0.180779
3 3 NA-ESI3 3 0 5.26787 Sun 5 5.39109 0.123218
4 4 NA-ESI3 3 0 5.79297 Sun 5 5.98826 0.195283
5 5 A-ESI3 2 0 7.30924 Sun 7 7.49349 0.184249
6 6 A-ESI3 2 0 7.71666 Sun 7 8.20255 0.485886
7 7 NA-ESI3 3 0 8.22392 Sun 8 9.76091 1.53699
8 8 ESI4 4 0 8.30123 Sun 8 8.41346 0.112227
9 9 ESI4 4 0 8.40325 Sun 8 9.91045 1.5072
So I with help from the comment, we have figure something out like this.
df_sim = pd.DataFrame({'ID': int(),
'Type': str(),
'Type_n':int(),
'FT':int(),
'ta':float(),
'DoW': str(),
'ToD': int(),
't_departure':float()
},index=[])
However, I am not sure what index=[] is required. But it seems to be working.

How to calculate the average number of days between the first and the second order in a dataframe which contains more than 2 orders per client?

I have a dataframe like the following:
id_cliente id_ordine data_ordine id_medium
0 madinside IML-0042758 2016-08-23 1190408
1 lisbeth19 IML-0071225 2017-02-26 1205650
2 lisbeth19 IML-0072944 2017-03-15 1207056
3 lisbeth19 IML-0077676 2017-05-12 1211395
4 lisbeth19 IML-0077676 2017-05-12 1207056
5 madinside IML-0094979 2017-09-29 1222195
6 lisbeth19 IML-0099675 2017-11-15 1211446
7 lisbeth19 IML-0099690 2017-11-15 1225212
8 lisbeth19 IML-0101439 2017-12-02 1226511
9 lisbeth19 IML-0109883 2018-03-14 1226511
I would like to add three columns:
the first column could be named "number of order per client" and should be the progression of orders made by the same client.
So order IML-0042758 should be 1, IML-0071225 should be 1, IML-0072944 should be 2, IML-0077676 should be 3, IML-0094979 should be 2, and so on..
the second column could be named "days between first and n order of the same client" and shows the the "data_ordine" difference (a datetime column) between the different orders made by the same client.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 17 (2017-03-15 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 75 (2017-05-12 - 2017-02-26), 402 (2017-09-29 - 2017-02-26).
the third column could be named "days between first and n order of the same id_medium" and shows the the "data_ordine" difference (a datetime column) between the different orders per id_medium.
So the values for the first 6 rows would be: 0 (2016-08-23 - 2016-08-23), 0 (2017-02-26 - 2017-02-26), 0 (2017-03-15 - 2017-03-15), 0 (2017-05-12 - 2017-05-12), 58 (2017-05-12 - 2017-03-15 because the medium "1207056" is ordered for the second time), 0 (2017-09-29 - 2017-09-29).
In the end I would like to calculate how long it takes in average for a client to make a second order, a third order, a fourth order and so on.
And how long it takes in average for a client to make a second, third (etc.) order for the same id_medium.
First convert to datetime and sort so the calculations are reliable.
The first column we can use groupby + ngroup to label each order, then we subtract the min from each person so they all start from 1
Days from 1st order, use groupby + transform to get the first date of each client then subtract
Third column is the same, just add id_medium to the grouping
Code:
df['data_ordine'] = pd.to_datetime(df['data_ordine'])
df = df.sort_values('data_ordine')
df['Num_ords'] = df.groupby(['id_cliente', 'id_ordine']).ngroup()
df['Num_ords'] = df.Num_ords - df.groupby(['id_cliente']).Num_ords.transform('min')+1
df['days_bet'] = (df.data_ordine -df.groupby('id_cliente').data_ordine.transform('min')).dt.days
df['days_bet_id'] = (df.data_ordine - df.groupby(['id_cliente', 'id_medium']).data_ordine.transform('min')).dt.days
Output:
id_cliente id_ordine data_ordine id_medium Num_ords days_bet days_bet_id
0 madinside IML-0042758 2016-08-23 1190408 1 0 0
1 lisbeth19 IML-0071225 2017-02-26 1205650 1 0 0
2 lisbeth19 IML-0072944 2017-03-15 1207056 2 17 0
3 lisbeth19 IML-0077676 2017-05-12 1211395 3 75 0
4 lisbeth19 IML-0077676 2017-05-12 1207056 3 75 58
5 madinside IML-0094979 2017-09-29 1222195 2 402 0
6 lisbeth19 IML-0099675 2017-11-15 1211446 4 262 0
7 lisbeth19 IML-0099690 2017-11-15 1225212 5 262 0
8 lisbeth19 IML-0101439 2017-12-02 1226511 6 279 0
9 lisbeth19 IML-0109883 2018-03-14 1226511 7 381 102

Re-format Dataframe column such that any numeric month substring is replaced with month string

Looking to reformat a string column as causing errors in Django. My df:
import pandas as pd
data = {'Date_Str'['2018_11','2018_12','2019_01','2019_02','2019_03','2019_04','2019_05','2019_06','2019_07','2019_08','2019_09','2019_10',],}
df = pd.DataFrame(dict(data))
print(df)
Date_Str
0 2018_11
1 2018_12
2 2019_01
3 2019_02
4 2019_03
5 2019_04
6 2019_05
7 2019_06
8 2019_07
9 2019_08
10 2019_09
11 2019_10
My solution:
df['Date_Month'] = df.Date_Str.str[-2:]
mapper = {'01':'Jan', '02':'Feb', '03':'Mar','04':'Apr','05':'May','06':'Jun','07':'Jul','08':'Aug','09':'Sep','10':'Oct','11':'Nov','12':'Dec'}
df['Date_Month_Str'] = df.Date_Str.str[0:4] + '_' + df.Date_Month.map(mapper)
print(df)
Desired output is column Date_Month_Str or simply update Date_Str with yyyy_mmm
Date_Str Date_Month Date_Month_Str
0 2018_11 11 2018_Nov
1 2018_12 12 2018_Dec
2 2019_01 01 2019_Jan
3 2019_02 02 2019_Feb
4 2019_03 03 2019_Mar
5 2019_04 04 2019_Apr
6 2019_05 05 2019_May
7 2019_06 06 2019_Jun
8 2019_07 07 2019_Jul
9 2019_08 08 2019_Aug
10 2019_09 09 2019_Sep
11 2019_10 10 2019_Oct
Can the three lines be reduced to one? Or simply update Date_Str with a one liner?
Convert column to datetimes and then use Series.dt.strftime:
df['Date_Month_Str'] = pd.to_datetime(df.Date_Str, format='%Y_%m').dt.strftime('%Y_%b')
print(df)
Date_Str Date_Month_Str
0 2018_11 2018_Nov
1 2018_12 2018_Dec
2 2019_01 2019_Jan
3 2019_02 2019_Feb
4 2019_03 2019_Mar
5 2019_04 2019_Apr
6 2019_05 2019_May
7 2019_06 2019_Jun
8 2019_07 2019_Jul
9 2019_08 2019_Aug
10 2019_09 2019_Sep
11 2019_10 2019_Oct

Python: Pandas dataframe re-arrange rows based on last three digits of Integer in Column

I have the following dataframe:
YearMonth Total Cost
2015009 $11,209,041
2015010 $20,581,043
2015011 $37,079,415
2015012 $36,831,335
2016008 $57,428,630
2016009 $66,754,405
2016010 $45,021,707
2016011 $34,783,970
2016012 $66,215,044
YearMonth is an int64 column. A value in YearMonth such as 2015009 stands for September 2015. I want to re-order the rows so that if the last 3 digits are the same, then I want the rows to appear right on top of each other sorted by year.
Below is my desired output:
YearMonth Total Cost
2015009 $11,209,041
2016009 $66,754,405
2015010 $20,581,043
2016010 $45,021,707
2015011 $37,079,415
2016011 $34,783,970
2015012 $36,831,335
2016012 $66,215,044
2016008 $57,428,630
I have scoured google to try and find how to do this but to no avail.
df['YearMonth'] = pd.to_datetime(df['YearMonth'],format = '%Y0%m')
df['Year'] = df['YearMonth'].dt.year
df['Month'] = df['YearMonth'].dt.month
df.sort_values(['Month','Year'])
YearMonth Total Year Month
8 2016-08-01 $57,428,630 2016 8
0 2015-09-01 $11,209,041 2015 9
1 2016-09-01 $66,754,405 2016 9
2 2015-10-01 $20,581,043 2015 10
3 2016-10-01 $45,021,707 2016 10
4 2015-11-01 $37,079,415 2015 11
5 2016-11-01 $34,783,970 2016 11
6 2015-12-01 $36,831,335 2015 12
7 2016-12-01 $66,215,044 2016 12
One way of doing. There may be a quicker way with fewer steps that don't involve converting YearMonth to datetime but if you have a date, it makes more sense to use that.
One way of dong this to cast your int column to string and use the string access with indexing.
df.assign(sortkey=df.YearMonth.astype(str).str[-3:])\
.sort_values('sortkey')\
.drop('sortkey', axis=1)
Output:
YearMonth Total Cost
4 2016008 $57,428,630
0 2015009 $11,209,041
5 2016009 $66,754,405
1 2015010 $20,581,043
6 2016010 $45,021,707
2 2015011 $37,079,415
7 2016011 $34,783,970
3 2015012 $36,831,335
8 2016012 $66,215,044

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Categories