I have df:
ID,"address","used_at","active_seconds","pageviews"
71ecd2aa165114e5ee292131f1167d8c,"auto.drom.ru",2014-05-17 10:58:59,166,2
71ecd2aa165114e5ee292131f1167d8c,"auto.drom.ru",2016-07-17 17:34:07,92,4
70150aba267f671045f147767251d169,"avito.ru/*/avtomobili",2014-06-15 11:52:09,837,40
bc779f542049bcabb9e68518a215814e,"auto.yandex.ru",2014-01-16 22:23:56,8,1
bc779f542049bcabb9e68518a215814e,"avito.ru/*/avtomobili",2014-01-18 14:38:33,313,5
bc779f542049bcabb9e68518a215814e,"avito.ru/*/avtomobili",2016-07-18 18:12:07,20,1
I need to delete all strings where used_at more than 2016-06-30. How can I do that?
Use dt.date with boolean indexing:
print (df.used_at.dt.date > pd.to_datetime('2016-06-30').date())
0 False
1 True
2 False
3 False
4 False
5 True
Name: used_at, dtype: bool
print (df[df.used_at.dt.date > pd.to_datetime('2016-06-30').date()])
ID address \
1 71ecd2aa165114e5ee292131f1167d8c auto.drom.ru
5 bc779f542049bcabb9e68518a215814e avito.ru/*/avtomobili
used_at active_seconds pageviews
1 2016-07-17 17:34:07 92 4
5 2016-07-18 18:12:07 20 1
Or you can define datetime by year, month and day:
print (df[df.used_at.dt.date > pd.datetime(2016, 6, 30).date()])
ID address \
1 71ecd2aa165114e5ee292131f1167d8c auto.drom.ru
5 bc779f542049bcabb9e68518a215814e avito.ru/*/avtomobili
used_at active_seconds pageviews
1 2016-07-17 17:34:07 92 4
5 2016-07-18 18:12:07 20 1
Related
I'm trying to add a calculated column in a dataframe based on a condition that includes other dataframe.
Example:
I have a dataframe Users that contains:
Out[4]:
UserID Active BaseSalaryCOP BaseSalaryUSD FromDate ToDate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749 2475.20 05/11/2020 05/11/2021
1 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 3831329 1008.24 05/11/2020 04/11/2021
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657 993.59 05/11/2020 05/11/2021
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508 2511.19 05/11/2020 05/11/2021
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035 2366.85 05/11/2020 05/11/2021
And I have another called Rate that contains the UserId.
I want to add A calculate column to add the BaseSalaryUSD Divide between 18 where the USer ID match and ToDate match as well.
Something like (If date Match with toDate and USerID Match then add a new colum that contains User["BaseSalaryUSD"] / 18):
Out[5]:
AccountID Date rate
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 04/21/2021 137.51
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e 05/11/2021 55.19
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 05/11/2021 139.51
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 05/11/2021 131.49
Any idea?
Thanks
Use outer join by both Dataframes, then filter by Series.between and divide column by Series.div:
Rate['Date'] = pd.to_datetime(Rate['Date'])
Users['FromDate'] = pd.to_datetime(Users['FromDate'])
Users['ToDate'] = pd.to_datetime(Users['ToDate'])
df = Users.merge(Rate.rename(columns={'AccountID':'UserID'}), on='UserID', how='outer')
df = df[df['Date'].between(df['FromDate'], df['ToDate'])]
df['new'] = df['BaseSalaryUSD'].div(18)
print (df)
UserID Active BaseSalaryCOP \
0 557058:36103848-2606-4d87-9af8-b0498f1c6713 True 9405749
2 557058:7df66ef4-b04d-4ce9-9cdc-55751909a61e True 3775657
3 557058:b0a4e46c-9bfe-439e-ae6e-500e3c2a87e2 True 9542508
4 557058:b25dbdb2-aa23-4706-9e50-90b2f66b60a5 True 8994035
BaseSalaryUSD FromDate ToDate Date rate new
0 2475.20 2020-05-11 2021-05-11 2021-04-21 137.51 137.511111
2 993.59 2020-05-11 2021-05-11 2021-05-11 55.19 55.199444
3 2511.19 2020-05-11 2021-05-11 2021-05-11 139.51 139.510556
4 2366.85 2020-05-11 2021-05-11 2021-05-11 131.49 131.491667
I've got a dataframe - and I want to drop specific rows per group ("id"):
id - month - max
1 - 112016 - 41
1 - 012017 - 46
1 - 022017 - 156
1 - 032017 - 164
1 - 042017 - 51
2 - 042017 - 26
2 - 052017 - 156
2 - 062017 - 17
for each "id", find location of first row (sorted by "month") where "max" is >62
keep all rows above (within this group), drop rest of rows
Expected result:
id - month - max
1 - 112016 - 41
1 - 012017 - 46
2 - 042017 - 26
I'm able to identify the first row which has to be deleted per group, but I'm stuck from that point on:
df[df.max > 62].sort_values(['month'], ascending=[True]).groupby('id', as_index=False).first()
How can I get rid of the rows?
Best regards,
david
Use:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
#comopare by gt (>) for cumulative sum per groups and filter equal 0
df1= df[df['max'].gt(62).groupby(df['id']).cumsum().eq(0)]
print (df1)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
Or use a custom function if need also first value >62:
#convert to datetimes
df['month'] = pd.to_datetime(df['month'], format='%m%Y')
#sorting per groups if necessary
df = df.sort_values(['id','month'])
def f(x):
m = x['max'].gt(62)
first = m[m].index[0]
x = x.loc[ :first]
return x
df = df.groupby('id', group_keys=False).apply(f)
print (df)
id month max
0 1 2016-11-01 41
1 1 2017-01-01 46
2 1 2017-02-01 156
5 2 2017-04-01 83
import pandas as pd
datadict = {
'id': [1,1,1,1,1,2,2,2],
'max': [41,46,156,164,51,83,156,17],
'month': ['112016', '012017', '022017', '032017', '042017', '042017', '052017', '062017'],
}
df = pd.DataFrame(datadict)
print (df)
id max month
0 1 41 112016
1 1 46 012017
2 1 156 022017
3 1 164 032017
4 1 51 042017
5 2 83 042017
6 2 156 052017
7 2 17 062017
df = df.loc[df['max']>62,:]
print (df)
id max month
2 1 156 022017
3 1 164 032017
5 2 83 042017
6 2 156 052017
Below are SQL queries to update Date in new format
update data set Date=[Time Period]+'-01-01' where Frequency='0'
update data set Date=replace([Time Period],'Q1','-01-01')
where Frequency='2' and substring([Time Period],5,2)='Q1'
update data set Date=replace([Time Period],'Q2','-04-01')
where Frequency='2' and substring([Time Period],5,2)='Q2'
update data set Date=replace([Time Period],'Q3','-07-01')
where Frequency='2' and substring([Time Period],5,2)='Q3'
update data set Date=replace([Time Period],'Q4','-10-01')
where Frequency='2' and substring([Time Period],5,2)='Q4'
update data set Date=replace([Time Period],'M','-')+'-01'
where Frequency='3' and len([Time Period])=7
update data set Date=replace([Time Period],'M','-0')+'-01'
where Frequency='3' and len([Time Period])=6
Now I have loaded same data into python data frame,
Sample data from data frame with comma separated.
Column: Time Period is the input data and Date column is output date, I need to convert Time Period to column Date format.
Frequency,Time Period,Date
0,2008,2008-01-01
0,1961,1961-01-01
2,2009Q1,2009-04-01
2,1975Q4,1975-10-01
2,2007Q3,2007-04-01
2,1959Q4,1959-10-01
2,1965Q4,1965-07-01
2,2008Q3,2008-07-01
3,1969M2,1969-02-01
3,1994M12,1994-12-01
3,1990M1,1990-01-01
3,1994M10,1994-10-01
3,2012M11,2012-11-01
3,1994M3,1994-03-01
Please let me know how to update Date as per above condition in python.
This is bit tricky to use a vectorized apparoach when adding different offsets.
Consider the following approach:
Source DF:
In [337]: df
Out[337]:
Frequency Time Period
0 0 2008
1 0 1961
2 2 2009Q1
3 2 1975Q4
4 2 2007Q3
5 2 1959Q4
6 2 1965Q4
7 2 2008Q3
8 3 1969M2
9 3 1994M12
10 3 1990M1
11 3 1994M10
12 3 2012M11
13 3 1994M3
Solution:
In [338]: %paste
df[['y','mm']] = (df['Time Period']
.replace(['Q1', 'Q2', 'Q3', 'Q4'],
['M0', 'M3', 'M6', 'M9'],
regex=True)
.str.extract('(\d{4})M?(\d+)?', expand=True))
df['Date'] = (pd.to_datetime(df.pop('y'), format='%Y', errors='coerce')
.values.astype('M8[M]') \
+ \
pd.to_numeric(df.pop('mm'), errors='coerce') \
.fillna(0).astype(int).values * np.timedelta64(1, 'M')) \
.astype('M8[D]')
## -- End pasted text --
Result:
In [339]: df
Out[339]:
Frequency Time Period Date
0 0 2008 2008-01-01
1 0 1961 1961-01-01
2 2 2009Q1 2009-01-01
3 2 1975Q4 1975-10-01
4 2 2007Q3 2007-07-01
5 2 1959Q4 1959-10-01
6 2 1965Q4 1965-10-01
7 2 2008Q3 2008-07-01
8 3 1969M2 1969-03-01
9 3 1994M12 1995-01-01
10 3 1990M1 1990-02-01
11 3 1994M10 1994-11-01
12 3 2012M11 2012-12-01
13 3 1994M3 1994-04-01
EDIT by Scott Boston please remove if you find a better way.
df[['y','mm']] = (df['Period']
.replace(['Q1', 'Q2', 'Q3', 'Q4'],
['M1', 'M4', 'M7', 'M10'],
regex=True)
.str.extract('(\d{4})M?(\d+)?', expand=True))
df['Date'] = (pd.to_datetime(df.pop('y'), format='%Y', errors='coerce')
.values.astype('M8[M]') \
+ \
pd.to_numeric(df.pop('mm'), errors='coerce') \
.fillna(1).astype(int).values - 1 * np.timedelta64(1, 'M')) \
.astype('M8[D]')
Output:
Frequency Time Period Date
0 0 0 2008 2008-01-01
1 1 0 1961 1961-01-01
2 2 2 2009Q1 2009-01-01
3 3 2 1975Q4 1975-10-01
4 4 2 2007Q3 2007-07-01
5 5 2 1959Q4 1959-10-01
6 6 2 1965Q4 1965-10-01
7 7 2 2008Q3 2008-07-01
8 8 3 1969M2 1969-02-01
9 9 3 1994M12 1994-12-01
10 10 3 1990M1 1990-01-01
11 11 3 1994M10 1994-10-01
12 12 3 2012M11 2012-11-01
13 13 3 1994M3 1994-03-01
I have a dataframe called houses:
transaction_id house_id date_sale sale_price boolean_2015 \
0 1 1 31 Mar 2016 £880,000 True
3 4 2 31 Mar 2016 £450,000 True
4 5 3 31 Mar 2016 £680,000 True
6 7 4 31 Mar 2016 £1,850,000 True
postcode
0 EC2Y
3 EC2Y
4 EC1Y
6 EC2Y
and I was wondering how to compute averages of sale_price based on each postcode
so the output is
Average
0 EC1Y £123220
1 EC2Y £434930
I did this with averages = data.groupby(['postcode'], as_index=False).mean()
but this did not return sale_price
any thoughts?
You can first replace £, to empty string and then convert to_numeric column sale_price. Last cast to string by astype if need add £ to column sale_price:
data.sale_price = pd.to_numeric(data.sale_price.str.replace('[£,]',''))
averages = data.groupby(['postcode'], as_index=False)['sale_price'].mean()
averages.sale_price = '£' + averages.sale_price.astype(str)
print (averages)
postcode sale_price
0 EC1Y £680000
1 EC2Y £1060000
I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.