pandas merge on date range - python

I have two dataframes,
df = pd.DataFrame({'Date': ['2011-01-02', '2011-04-10', '2015-02-02', '2016-03-03'], 'Price': [100, 200, 300, 400]})
df2 = pd.DataFrame({'Date': ['2011-01-01', '2014-01-01'], 'Revenue': [14, 128]})
I want add df2.revenue to df to produce the below table using the both the date columns for reference.
Date Price Revenue
2011-01-02 100 14
2011-04-10 200 14
2015-02-02 300 128
2016-03-03 400 128
As above the revenue is added according df2.Date and df.Date

Use merge_asof:
df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df3 = pd.merge_asof(df, df2, on='Date')
print (df3)
Date Price Revenue
0 2011-01-02 100 14
1 2011-04-10 200 14
2 2015-02-02 300 128
3 2016-03-03 400 128

Related

How to convert monthly data to yearly data with the value as the mean average over the 12 months? (Python Pandas)

This is what my data looks like:
month total_mobile_subscription
0 1997-01 414000
1 1997-02 423000
2 1997-03 431000
3 1997-04 479000
4 1997-05 510000
.. ... ...
279 2020-04 9056300
280 2020-05 8928800
281 2020-06 8860000
282 2020-07 8768500
283 2020-08 8659000
[284 rows x 2 columns]
Basically, I'm trying to change this into a dataset sorted by year with the value being the mean average of total mobiles subscriptions for each year.
I am not sure what to do as I am still learning this.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'year': ['1997-01', '1997-02', '1997-03', '1998-01', '1998-02', '1998-03'],
'sale': [500, 1000, 1500, 2000, 1000, 400]
})
for a in range(1,13):
x = '-0'+ str(a)
df['year'] = df['year'].str.replace(x, '')
df2 = (df.groupby(['year']).mean('sale'))
print(df2)
Convert values of month to datetimes and aggregate by years:
y = pd.to_datetime(df['month']).dt.year.rename('year')
df1 = df.groupby(y)['total_mobile_subscription'].mean().reset_index()
df['month'] = pd.to_datetime(df['month'])
df1 = (df.groupby(df['month'].dt.year.rename('year'))['total_mobile_subscription']
.mean().reset_index())
Or aggregate by first 4 values of month column:
df2 = (df.groupby(df['month'].str[:4].rename('year'))['total_mobile_subscription']
.mean().reset_index())

Randomly sample rows based on year-month

data = {'date':['2019-01-01', '2019-01-02', '2020-01-01', '2020-02-02'],
'tweets':["aaa", "bbb", "ccc", "ddd"]}
df = pandas.DataFrame(data)
df['daate'] = pandas.to_datetime(df['date'], infer_datetime_format=True)
So I have an object type date and a datetime64[ns] type date. Image that I have 100 rows in each year-month. How can I randomly sample 10 rows in each year-month and put them into a data frame? Thanks!
Use DataFrame.groupby per years and months or month periods and use custom lambda function with DataFrame.sample:
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
Or:
df1 = (df.groupby(df['daate'].dt.to_period('m'), group_keys=False)
.apply(lambda x: x.sample(n=10)))
Sample:
data = {'daate':pd.date_range('2019-01-01', '2020-01-22'),
'tweets':np.random.choice(["aaa", "bbb", "ccc", "ddd"], 387)
}
df = pd.DataFrame(data)
df1 = (df.groupby([df['daate'].dt.year, df['daate'].dt.month], group_keys=False)
.apply(lambda x: x.sample(n=10)))
print (df1)
date tweets daate
9 2019-01-10 bbb 2019-01-10
29 2019-01-30 ddd 2019-01-30
17 2019-01-18 ccc 2019-01-18
12 2019-01-13 ccc 2019-01-13
20 2019-01-21 ddd 2019-01-21
.. ... ... ...
381 2020-01-17 bbb 2020-01-17
375 2020-01-11 aaa 2020-01-11
373 2020-01-09 bbb 2020-01-09
368 2020-01-04 aaa 2020-01-04
382 2020-01-18 bbb 2020-01-18
[130 rows x 3 columns]
import pandas as pd
data = {"date": ["2019-01-01", "2019-01-02", "2020-01-01", "2020-02-02"], "tweets": ["aaa", "bbb", "ccc", "ddd"]}
df = pd.DataFrame(data)
df["daate"] = pd.to_datetime(df["date"], infer_datetime_format=True)
# Just duplicating row
df = df.loc[df.index.repeat(100)]
# The actual code
available_dates = df["daate"].unique()
sampled_df = pd.DataFrame()
for each_date in available_dates:
rows_with_that_date = df.loc[df["daate"] == each_date]
sampled_rows_with_that_date = rows_with_that_date.sample(5) # 5 samples
sampled_df = sampled_df.append(sampled_rows_with_that_date)
print(len(sampled_df))

Difficult date calculation in DataFrame in Pyton Pandas?

I have DataFrame like below:
rng = pd.date_range('2020-12-11', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'status': ['active', 'active', 'finished', 'finished', 'active'] })
And I need to create 2 new columns in this DataFrame:
New1 = amount of days from "Date" column until today for status 'active'
New2 = amount of days from "Date" column until today for status 'finished'
Below sample result:
Use Series.rsub for subtract from right side with today by Timestamp and Timestamp.floor, convert timedeltas to days by Series.dt.days and assign new columns by condition in Series.where:
rng = pd.date_range('2020-12-01', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng,
'status': ['active', 'active', 'finished', 'finished', 'active'] })
days = df['Date'].rsub(pd.Timestamp('now').floor('d')).dt.days
df['New1'] = days.where(df['status'].eq('active'))
df['New2'] = days.where(df['status'].eq('finished'))
print (df)
Date status New1 New2
0 2020-12-01 active 13.0 NaN
1 2020-12-02 active 12.0 NaN
2 2020-12-03 finished NaN 11.0
3 2020-12-04 finished NaN 10.0
4 2020-12-05 active 9.0 NaN

add rows to pandas dataframe based on days in week

So I'm fairly new to pandas and I run into this problem that I'm not able to fix.
I have the following dataframe:
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df
I would like to add rows to my dateframe based on the column 'Day', ideally adding all other days of the weeks, but keeping the rest of the columns the same value. The output should look something like this:
Day Product_Finished Product_Tested
0 2018-12-31 1000 50
1 2019-01-01 1000 50
2 2019-01-02 1000 50
3 2019-01-03 1000 50
4 2019-01-04 1000 50
5 2019-01-05 1000 50
6 2019-01-06 1000 50
7 2019-01-07 2000 10
8 2019-01-08 2000 10
9 2019-01-09 2000 10
10 2019-01-10 2000 10
11 2019-01-11 2000 10
12 2019-01-12 2000 10
13 2019-01-13 2000 10
Any tips would be greatly appreciated, thank you in advance!
You can achieve this by first creating a new DataFrame that contains the desired date range using pandas.date_range.
Step 2, use pandas.merge_asof specifying to get the last value.
You can re sample by
import datetime
import pandas as pd
df = pd.DataFrame({
'Day': ['2018-12-31', '2019-01-07'],
'Product_Finished': [1000, 2000],
'Product_Tested': [50, 10]})
df['Day'] = pd.to_datetime(df['Day'], format='%Y-%m-%d')
df.set_index('Day',inplace=True)
df_Date=pd.date_range(start=df.index.min(), end=(df.index.max()+ datetime.timedelta(days=6)), freq='D')
df=df.reindex(df_Date,method='ffill',fill_value=None)
df.reset_index(inplace=True)

Cumulative sum over days in python

I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]

Categories