I have a dataframe that looks like:
id email domain created_at company
0 1 son#mail.com old.com 2017-01-21 18:19:00 company_a
1 2 boy#mail.com new.com 2017-01-22 01:19:00 company_b
2 3 girl#mail.com nadda.com 2017-01-22 01:19:00 no_company
I need summarize the data by Year, Month and if the company has a value that doesn't match "no_company":
Desired output:
year month company count
2017 1 has_company 2
no_company 1
The following works great but gives me the count for each value in the company column;
new_df = test_df['created_at'].groupby([test_df.created_at.dt.year, test_df.created_at.dt.month, test_df.company]).agg('count')
print(new_df)
result:
year month company
2017 1 company_a 1
company_b 1
no_company 1
Map a new series for has_company/no_company then groupby:
c = df.company.map(lambda x: x if x == 'no_company' else 'has_company')
y = df.created_at.dt.year.rename('year')
m = df.created_at.dt.month.rename('month')
df.groupby([y, m, c]).size()
year month company
2017 1 has_company 2
no_company 1
dtype: int64
Related
I have a meteorological data set with daily precipitation values for 120 years. I would like to prepare this in such a way that I have monthly average values for 4 climate periods at the end. Example: Average precipitation January, February, March, ... for period 1981 - 2010, average precipitation January, February, March, ... for period 2011 - 2040 and so on.
Data set looks like this (is available as csv file, read in as pandas dataframe):
year month day lon lat value
0 1981 1 1 0 0 0.522592
1 1981 1 2 0 0 2.692495
2 1981 1 3 0 0 0.556698
3 1981 1 4 0 0 0.000000
4 1981 1 5 0 0 0.000000
... ... ... ... ... ... ...
43824 2100 12 27 0 0 0.000000
43825 2100 12 28 0 0 0.185120
43826 2100 12 29 0 0 10.252080
43827 2100 12 30 0 0 13.389290
43828 2100 12 31 0 0 3.523566
Here my code until now:
csv_path = r'filepath.csv'
df = pd.read_csv(csv_path, delimiter = ';')
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods = 6, freq = '30YS').strftime('%Y')
labels = [f'{a}-{b}' for a, b in zip(years, years[1:])]
(df.assign(period = pd.cut(df['year'], bins = years.astype(int), labels = labels, right = False)).groupby(df[['year', 'month']].dt.to_period('M')).agg({'period': 'first', 'value': 'sum'}).groupby('period')['value'].mean())
The best way is probably to write a loop that iterates over all months and the 4 30-year periods, but unfortunately I can't get this to work. Does anyone have any tips?
Expected Output:
Month Average
0 January 20
1 Febuary 21
2 March 19
3 April 18
To get the total value per month and then the average per periods 30 years, you need to use a double groupby:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods=6, freq='30YS').strftime('%Y')
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
(df
.assign(period=pd.cut(df['year'], bins=years.astype(int), labels=labels, right=False))
.groupby(df['date'].dt.to_period('M')).agg({'period':'first', 'value': 'sum'})
.groupby('period')['value'].mean()
)
output:
period
1981-2011 3.771785
2011-2041 NaN
2041-2071 NaN
2071-2101 27.350056
2101-2131 NaN
Name: value, dtype: float64
older answer
The expected output is not fully clear, but if you want average precipitation per quarter per year:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.to_period('Q')
df.groupby('quarter')['value'].mean()
output:
quarter
1981Q1 0.754357
2100Q4 5.470011
Freq: Q-DEC, Name: value, dtype: float64
or per quarter globally:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.quarter
df.groupby('quarter')['value'].mean()
output:
quarter
1 0.754357
4 5.470011
Name: value, dtype: float64
NB. you can do the same for other periods. For months use to_period('M') / .dt.month
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['period'] = df['date'].dt.to_period('M')
df.groupby('period')['value'].mean()
output:
period
1981-01 0.754357
2100-12 5.470011
Freq: M, Name: value, dtype: float64
My df looks like this.
Policy_No Date
1 10/1/2020
2 20/2/2020
3 20/2/2020
4 23/3/2020
5 18/4/2020
6 30/4/2020
7 30/4/2020
I would like to create a cumulative counter of policies logged in different dates based on the financial year (April-March)
Date Cum count of policies
10/1/2020 1
20/2/2020 3
23/3/2020 4
18/4/2021 1
30/4/2021 3
18th April 2021 being a new financial year, the counter starts from 0.
Can someone help solve this?
there's a function called cumsum which does that:
df = pd.DataFrame({"Policy_No":[1,2,3,4,5,6,7],"Date":["10/1/2020","20/2/2020","20/2/2020","23/3/2020","18/4/2020","30/4/2020","30/4/2020"]})
print(df)
#0 1 10/1/2020
#1 2 20/2/2020
#2 3 20/2/2020
#3 4 23/3/2020
#4 5 18/4/2020
#5 6 30/4/2020
#6 7 30/4/2020
df.groupby("Date")["Policy_No"].count().cumsum()
#Date
#10/1/2020 1
#18/4/2020 2
#20/2/2020 4
#23/3/2020 5
#30/4/2020 7
If you want to do it for each financial year, I think you'll need to create a dataframe for each financial year, use the above logic, and concat them at last:
df = ... #dataframe
year_2020 = pd.to_datetime("01/04/2020")<= df["date"] < pd.to_datetime("01/04/2021")
df_2020 = df.loc[year_2020].groupby("date")["Policy_No"].count().cumsum()
year_2021 = pd.to_datetime("01/04/2021")<= df["date"] < pd.to_datetime("01/04/2022")
df_2021 = df.loc[year_2021].groupby("date")["Policy_No"].count().cumsum()
#concat at the end
df_total = pd.concat((df_2020,df_2021))
Of course if you cannot do the year logic (because there's to many), you can place it within a loop like:
def get_financial_dates():
"""
Some function that returns the start and end
of each financial year
"""
return date_start,date_end
df_total = pd.DataFrame() #initial dataframe
for date_start, date_end in get_financial_dates():
idx = date_start <= df["date"] < date_end
df_temp = df.loc[idx].groupby("date")["Policy_No"].count().cumsum()
#concat at the end
df_total = pd.concat((df_total,df_temp))
I have a dataframe like the following:
df.head(4)
timestamp user_id category
0 2017-09-23 15:00:00+00:00 A Bar
1 2017-09-14 18:00:00+00:00 B Restaurant
2 2017-09-30 00:00:00+00:00 B Museum
3 2017-09-11 17:00:00+00:00 C Museum
I would like to count for each hour for each the number of visitors for each category and have a dataframe like the following
df
year month day hour category count
0 2017 9 11 0 Bar 2
1 2017 9 11 1 Bar 1
2 2017 9 11 2 Bar 0
3 2017 9 11 3 Bar 1
Assuming you want to groupby date and hour, you can use the following code if the timestamp column is a datetime column
df.year = df.timestamp.dt.year
df.month = df.timestamp.dt.month
df.day = df.timestamp.dt.day
df.hour = df.timestamp.dt.hour
grouped_data = df.groupby(['year','month','day','hour','category']).count()
For getting the count of user_id per hour per category you can use groupby with your datetime:
df.timestamp = pd.to_datetime(df['timestamp'])
df_new = df.groupby([df.timestamp.dt.year,
df.timestamp.dt.month,
df.timestamp.dt.day,
df.timestamp.dt.hour,
'category']).count()['user_id']
df_new.index.names = ['year', 'month', 'day', 'hour', 'category']
df_new = df_new.reset_index()
When you have a datetime in dataframe, you can use the dt accessor which allows you to access different parts of the datetime, i.e. year.
My input dataframe is something like this:
here for every company we can have multiple salesid and each salesid has unique create date.
CompanyName Salesid Create Date
ABC 1 1-1-2020
ABC 22 4-1-2020
ABC 3 15-1-2020
ABC 4 10-1-2020
XYZ 34 19-2-2020
XYZ 56 23-2-2020
XYZ 23 11-2-2020
XYZ 87 27-2-2020
XYZ 101 5-2-2020
I want to calculate the mean createdate gap for each company:
I am expecting an output in this format:
Name Mean_createdate_gap
ABC 4.66
XYZ 5.5
explanation:
ABC => (3+6+5)/3 = 4.66 (cumulative diff between dates)
XYZ => (6+8+4+4)/4 = 5.5
For this first, we may need to sort the data followed by grouping by companyname. I am not sure how I suppose to implement it.
Here you go:
df['Create Date'] = pd.to_datetime(df['Create Date'], format='%d-%m-%Y')
res = df.sort_values(by='Create Date')\
.groupby('CompanyName', sort=False)['Create Date']\
.agg(lambda cd : cd.diff().map(lambda dt: dt.days).mean()).reset_index()\
.rename(columns={'CompanyName': 'Name', 'Create Date': 'Mean_createdate_gap'})
print(res)
Output
Name Mean_createdate_gap
0 ABC 4.666667
1 XYZ 5.500000
Covert Create column to datetime
df['Create'] = pd.to_datetime(df['Create'], format='%d-%m-%Y')
Sort by this column
df = df.sort_values(by=['Create'])
Do a groupby aggregate with cummulative differenced mean
df.groupby('CompanyName')['Create'].agg(lambda x: x.diff().abs().mean())
CompanyName
ABC 4 days 16:00:00
XYZ 5 days 12:00:00
Name: Create, dtype: timedelta64[ns]
I have a data set of of the same category. I want to compare the two date columns of the same category
I want to see if DATE1 less than in values in DATE2 of the same CATEGORY and find the earliest DATE it is greater than
I'm trying this but i'm not getting the results that I am looking for
df['test'] = np.where(m['DATE1'] < df['DATE2'], Y, N)
CATEGORY DATE1 DATE2 GREATERTHAN GREATERDATE
0 23 2015-01-18 2015-01-15 Y 2015-01-10
1 11 2015-02-18 2015-02-19 N 0
2 23 2015-03-18 2015-01-10 Y 2015-01-10
3 11 2015-04-18 2015-08-18 Y 2015-02-19
4 23 2015-05-18 2015-02-21 Y 2015-01-10
5 11 2015-06-18 2015-08-18 Y 2015-02-19
6 15 2015-07-18 2015-02-18 0 0
df['DATE1'] = pd.to_datetime(df['DATE1'])
df['DATE2'] = pd.to_datetime(df['DATE2'])
df['GREATERTHAN'] = np.where(df['DATE1'] > df['DATE2'], 'Y', 'N')
## Getting the earliest date for which data is available, per category
earliest_dates = df.groupby(['CATEGORY']).apply(lambda x: x['DATE1'].append(x['DATE2']).min()).to_frame()
## Merging to get the earliest date column per category
df.merge(earliest_dates, left_on = 'CATEGORY', right_on = earliest_dates.index, how = 'left')