I have a pandas dataframe which looks like this
pd.DataFrame({'a':['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3'],
'date':[date(2017, 12, 15), date(2018, 12, 20), date(2020, 1, 10), date(2017, 12, 15), date(2018, 12, 10), date(2017, 1, 5), date(2018, 1, 15), date(2019, 2, 20)],
'c':[5, 6, 7, 4, 8, 6, 5, 9]})
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2020-01-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9
'a' = customer
'date' = date when customer paid
'c' = amount customer paid
I need to check if the customer paid in each year but for customers which historically paid in December but in later years paid in January I would like to change the January date to a December date. so looking at cust1, historically she paid in December but then she missed to pay in December 2019 but paid in January 2020. I would like to move the date to the same day in December in the prior year.
Note: my dataframe has thousands with more customers and pay dates all through the year but i specifically want to apply the above rule only where historically payments were made in December but in later years are being made in January.
my resulting dataframe should look like this:
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2019-12-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9
EDIT
my dataframe is slightly more complex then initially described above, complexity being that I can have several times a customer is making a payment during any one year
a date c
0 cust1 2017-06-15 5
1 cust1 2017-12-15 5
2 cust1 2018-06-15 6
3 cust1 2019-01-20 6
4 cust1 2019-06-15 7
5 cust1 2020-01-10 7
6 cust1 2020-06-12 8
7 cust2 2017-12-15 4
8 cust2 2018-12-10 8
9 cust3 2017-01-05 6
10 cust3 2018-01-15 5
11 cust3 2019-02-20 9
so looking at cust1 she always makes 2 payments during the year. but the December 2018 payment was only done in January 2019. I would like to adjust the January date to a December date if in the prior year the payment was made in December and the for any subsequent years were there is a January payment
so my resulting dataframe should look like this:
a date c newDate
0 cust1 2017-06-15 5 2017-06-15
1 cust1 2017-12-15 5 2017-12-15
2 cust1 2018-06-15 6 2018-06-15
3 cust1 2019-01-20 6 2018-12-20
4 cust1 2019-06-15 7 2019-06-15
5 cust1 2020-01-10 7 2019-12-10
6 cust1 2020-06-12 8 2020-06-12
7 cust2 2017-12-15 4 2017-12-15
8 cust2 2018-12-10 8 2018-12-10
9 cust3 2017-01-05 6 2017-01-05
10 cust3 2018-01-15 5 2018-01-15
11 cust3 2019-02-20 9 2019-02-20
I tried the following incorporating some of the suggestions below:
df = pd.DataFrame({'a':['cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3'],
'date':[date(2017, 6, 15), date(2017, 12, 15), date(2018, 6, 15), date(2019, 1, 20), date(2019, 6, 15), date(2020, 1, 10), date(2020, 6, 12), date(2017, 12, 15), date(2018, 12, 10), date(2017, 1, 5), date(2018, 1, 15), date(2019, 2, 20)],
'c':[5, 5, 6, 6, 7, 7, 8, 4, 8, 6, 5, 9]})
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df_2 = df.loc[df['date'].dt.month.isin(year_end_month)].copy()
df_3 = pd.concat([df, df_2]).drop_duplicates(keep=False)
s=df_2.groupby('a').date.shift().dt.month
df_2['newDate']=np.where(s.eq(12) & df_2.date.dt.month.eq(1), df_2.date-
pd.DateOffset(months=1), df_2.date)
df_4 = pd.concat([df_2, df_3])
df_4.newDate = df_4.newDate.fillna(df_4.date)
df_4.sort_values(by=['a', 'date'])
The problem with my the above approach is that it works the first time the payment date is moved from December to January but it doesn't work for subsequent years. so looking at cust1 first time she switchted payment from December to January was in December 2018 to January 2019 and my approach captures this. but my approach fails to move her 2019 payment which she made in January 2020 to December 2019. Any idea how this can be solved for?
Check with groupby shift and find the row have the need to be fix , then do np.where
s=df.groupby('a').date.shift().dt.month
df['date']=np.where(s.eq(12) & df.date.dt.month.eq(1), df.date-pd.DateOffset(months=1), df.date)
df
a date c
0 cust1 2017-12-15 5
1 cust1 2018-12-20 6
2 cust1 2019-12-10 7
3 cust2 2017-12-15 4
4 cust2 2018-12-10 8
5 cust3 2017-01-05 6
6 cust3 2018-01-15 5
7 cust3 2019-02-20 9
I have two dataframes as follows
df1
Location Month Date Ratio
A June Jun 1 0.2
A June Jun 2 0.3
A June Jun 3 0.4
B June Jun 1 0.6
B June Jun 2 0.7
B June Jun 3 0.8
And df2
Location Month Value
A June 1000
B June 2000
Result should be as :
df3
Location Month Date Value
A June Jun 1 200
A June Jun 2 300
A June Jun 3 400
B June Jun 1 1200
B June Jun 2 1400
B June Jun 3 1600
How do I go about doing this. I am able to carry out division without problem as Pandas somehow does great job of matching indices while division but in multiplication result is all over the place.
Thanks.
You can use df.merge and df.assign
df.assign(Value = df.merge(df1,how='inner',on=['Location','Month'])['Value'].\
mul(df['Ratio']))
#or
# df = df.merge(df1,how='inner',on=['Location','Month'])
# df['Value']*=df['Ratio']
Location Month Date Ratio Value
0 A June Jun 1 0.2 200.0
1 A June Jun 2 0.3 300.0
2 A June Jun 3 0.4 400.0
3 B June Jun 1 0.6 1200.0
4 B June Jun 2 0.7 1400.0
5 B June Jun 3 0.8 1600.0
Or
using df.set_index
df.set_index(['Location','Month'],inplace=True)
df1.set_index(['Location','Month'],inplace=True)
df['Value'] = df['Ratio']*df1['Value']
IIUC and Location is index for both dataframe then you can use pandas.Series.mul
df1["Value"] = df1.Ratio.mul(df2.Value)
df1
Month Date Ratio Value
Location
A June Jun 1 0.2 200.0
A June Jun 2 0.3 300.0
A June Jun 3 0.4 400.0
B June Jun 1 0.6 1200.0
B June Jun 2 0.7 1400.0
B June Jun 3 0.8 1600.0
I have the following dataframe df:
Date number
0 AUG 17 1.0
1 AUG 17 1.6
2 FEB 18 1.0
3 MAR 18 1.7
4 APR 18 6.0
5 Jan 19 1.0
6 Apr 19 2.0
7 Jun 19 7.1
8 Jan 20 5.5
9 Feb 20 8.6
And I would like to convert the Date column into date type (with the last business day of the month (Monday- Friday)), so that I would get the following output:
Date number
0 2017-08-31 1.0
1 2017-08-31 1.6
2 2018-02-28 1.0
3 2018-03-30 1.7
4 2018-04-30 6.0
5 2019-01-31 1.0
6 2019-04-30 2.0
7 2019-06-28 7.1
8 2020-01-31 5.5
9 2020-02-28 8.6
NOTICE that some of my months are in CAPS.
I tried:
date = [datetime.datetime.strptime(x,'%b%Y').date() for x in df['Date']]
But keeps me giving matching error, I assume it is because some months are in CAPS.
is this what you are looking for? make use of the capitalize method (which is also awailable in pandas) to parse the date and add an offset from pd.offsets to get the appropriate business day:
import pandas as pd
# example df:
df = pd.DataFrame({'Date': ['AUG 17', 'aug 17', 'FEB 18', 'MAR 18'],
'number': [1, 1.6, 1, 1.7]})
# convert to datetime after capitalizing the month name, add offset so you can get last business day of month
df['Date'] = (pd.to_datetime(df['Date'].str.capitalize(), format='%b %y') +
pd.offsets.BMonthEnd(1))
# df
# Date number
# 0 2017-08-31 1.0
# 1 2017-08-31 1.6
# 2 2018-02-28 1.0
# 3 2018-03-30 1.7
So I figured out that the mistake I was making was that the Year part was also abbreviated. So it should be %y instead of %Y and also it should have a space between the month and the year : %b %y
So to achieve the output I wanted:
import pandas as pd
import datetime
# convert the string dates into date type
df['Date'] = [datetime.datetime.strptime(x,'%b %y').date() for x in df['Date']]
#convert to Business Days (Monday-Friday)
df = df.assign(Date=df['Date'] + pd.offsets.BMonthEnd(1))
I am using a calendar data set for price prediction for different houses with a date feature that includes 365 days of the year. I would like to minimize the data set by taking the average month price of each listing in a new column.
input data:
listing_id date price months
1 2020-01-08 75.0 Jan
1 2020-01-09 100.0 Jan
1 2020-02-08 350.0 Feb
2 2020-01-08 465.0 Jan
2 2020-02-08 250.0 Feb
2 2020-02-09 250.0 Feb
Output data:
listing_id date Avg_price months
1 2020-01-08 90.0 Jan
1 2020-02-08 100.0 Feb
2 2020-01-08 50.0 Jan
2 2020-02-08 150.0 Feb
You can get the average price for each month using groupby:
g = df.groupby("months")["price"].mean()
You can then create new columns:
for month, avg in g.iteritems():
df["average_{}".format(month)] = avg
Example with dummy data:
import pandas as pd
df = pd.DataFrame({'months':['Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar'],
'price':[1, 2, 3, 4, 5, 6]})
Result:
months price average_Feb average_Jan average_Mar
0 Jan 1 2.5 1.0 5.0
1 Feb 2 2.5 1.0 5.0
2 Feb 3 2.5 1.0 5.0
3 Mar 4 2.5 1.0 5.0
4 Mar 5 2.5 1.0 5.0
5 Mar 6 2.5 1.0 5.0
I upvoted Dan's answer.
It may help to see another way to do this.
Additionally, if you ever have data that spans multiple years you may want a month_year column instead.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html
Example:
df = pd.DataFrame({'price':[i for i in range(121)]},
index=pd.date_range(start='12/1/2017',end='3/31/2018'))
df = df.reset_index()
df['month_year'] = df['index'].dt.month_name() + " " +
df['index'].dt.year.astype(str)
df.pivot_table(values='price',columns='month_year')
Result:
In [39]: df.pivot_table(values='price',columns='month_year')
Out[39]:
month_year December 2017 February 2018 January 2018 March 2018
price 15.0 75.5 46.0 105.0
I've created a pandas dataframe using the 'read html' method from an external source. There's no problem creating the dataframe, however, I'm stuck trying to adjust the structure of the first column, 'Month'.
The data I'm scraping is updated once a month at the source, therefore, the solution requires a dynamic approach. So far I've only been able to achieve the desired outcome using .iloc to manually update each row, which works fine until the data is updated at source next month.
This is what my dataframe looks like:
df = pd.read_html(url)
df
Month Value
0 2017 NaN
1 November 1.29
2 December 1.29
3 2018 NaN
4 January 1.29
5 February 1.29
6 March 1.29
7 April 1.29
8 May 1.29
9 June 1.28
10 July 1.28
11 August 1.28
12 September 1.28
13 October 1.26
14 November 1.16
15 December 1.09
16 2019 NaN
17 January 1.25
18 February 1.34
19 March 1.34
20 April 1.34
This is my desired outcome:
df
Month Value
0 November 2017 1.29
2 December 2017 1.29
4 January 2018 1.29
5 February 2018 1.29
6 March 2018 1.29
7 April 2018 1.29
8 May 2018 1.29
9 June 2018 1.28
10 July 2018 1.28
11 August 2018 1.28
12 September 2018 1.28
13 October 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 January 2019 1.25
18 February 2019 1.34
19 March 2019 1.34
20 April 2019 1.34
Right now the best idea I've come up with would be select, extract and append the year to each row in the 'Month' column, until the month 'December' is reached, and then switch to/increment to next year, but i have no idea how to implement this in code. Would this be a viable solution (and how could it be implemented?) or is there a better way?
Many thanks from a long time reader and first time poster of stackoverflow!
Using ffill base on value, if it is NaN then we should forward fill the year here for future paste
df.Month=df.Month+' '+df.Month.where(df.Value.isna()).ffill().astype(str)
df.dropna(inplace=True)
df
Out[29]:
Month Value
1 November 2017 1.29
2 December 2017 1.29
4 Januari 2018 1.29
5 Februari 2018 1.29
6 Mars 2018 1.29
7 April 2018 1.29
8 Maj 2018 1.29
9 Juni 2018 1.28
10 Juli 2018 1.28
11 Augusti 2018 1.28
12 September 2018 1.28
13 Oktober 2018 1.26
14 November 2018 1.16
15 December 2018 1.09
17 Januari 2019 1.25
18 Februari 2019 1.34
19 Mars 2019 1.34
20 April 2019 1.34