pandas merge DataFrames like magnetic thing

pandas merge DataFrames like magnetic thing - python

import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-10', '2015-01-11', '2015-01-12'], 'a': [1,2,3,4]})
df2 = pd.DataFrame({'date': ['2015-01-01', '2015-01-05', '2015-01-11'], 'b': [10,20,30]})
df = df1.merge(df2, on=['date'], how='outer')
df = df.sort_values('date')
print df
"like magnetic thing" may not be a good expression in title. I will explain below.
I want record from df2 to match the first record of df1 which date is greater or equals df2's. For example, I want df2's '2015-01-05' to match df1's '2015-01-10'.
I cannot achieve it by simply merging them in inner, outer, left way. Though, the above result is very close to what I want.
a date b
0 1.0 2015-01-01 10.0
4 NaN 2015-01-05 20.0
1 2.0 2015-01-10 NaN
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN
How can achieve this from what I have done or in some other ways from scratch?
a date b
0 1.0 2015-01-01 10.0
1 2.0 2015-01-10 20.0
2 3.0 2015-01-11 30.0
3 4.0 2015-01-12 NaN

making sure your dates are dates
df1.date = pd.to_datetime(df1.date)
df2.date = pd.to_datetime(df2.date)
numpy
np.searchsorted
ilocs = df1.date.values.searchsorted(df2.date.values)
df1.loc[df1.index[ilocs], 'b'] = df2.b.values
df1
a date b
0 1 2015-01-01 10.0
1 2 2015-01-10 20.0
2 3 2015-01-11 30.0
3 4 2015-01-12 NaN
pandas
pd.merge_asof gets you really close
pd.merge_asof(df1, df2)
a date b
0 1 2015-01-01 10
1 2 2015-01-10 20
2 3 2015-01-11 30
3 4 2015-01-12 30

Related

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

How to sort columns except index column in a data frame in python after pivot

So I have a data frame
testdf = pd.DataFrame({"loc" : ["ab12","bc12","cd12","ab12","bc13","cd12"], "months" :
["Jun21","Jun21","July21","July21","Aug21","Aug21"], "dept" :
["dep1","dep2","dep3","dep2","dep1","dep3"], "count": [15, 16, 15, 92, 90, 2]})
That looks like this:
When I pivot it,
df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
df.columns = df.columns.droplevel(0)
df
it looks like this:
I am looking for a sort function which will sort only the months columns in sequence and not the first 2 columns i.e loc & dept.
when I try this:
df.sort_values(by = ['Jun21'],ascending = False, inplace = True, axis = 1, ignore_index=True)[2:]
it gives me error.
I want the columns to be in sequence Jun21, Jul21, Aug21
I am looking for something which will make it dynamic and I wont need to manually change the sequence when the month changes.
Any hint will be really appreciated.

It is quite simple if you do using groupby
df = testdf.groupby(['loc', 'dept', 'months']).sum().unstack(level=2)
df = df.reindex(['Jun21', 'July21', 'Aug21'], axis=1, level=1)
Output
count
months Jun21 July21 Aug21
loc dept
ab12 dep1 15.0 NaN NaN
dep2 NaN 92.0 NaN
bc12 dep2 16.0 NaN NaN
bc13 dep1 NaN NaN 90.0
cd12 dep3 NaN 15.0 2.0

We can start by converting the column months in datetime like so :
>>> testdf.months = (pd.to_datetime(testdf.months, format="%b%y", errors='coerce'))
>>> testdf
loc months dept count
0 ab12 2021-06-01 dep1 15
1 bc12 2021-06-01 dep2 16
2 cd12 2021-07-01 dep3 15
3 ab12 2021-07-01 dep2 92
4 bc13 2021-08-01 dep1 90
5 cd12 2021-08-01 dep3 2
Then, we apply your code to get the pivot :
>>> df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
>>> df.columns = df.columns.droplevel(0)
>>> df
months NaT NaT 2021-06-01 2021-07-01 2021-08-01
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0
And to finish we can reformat the column names using strftime to get the expected result :
>>> df.columns = df.columns.map(lambda t: t.strftime('%b%y') if pd.notnull(t) else '')
>>> df
months Jun21 Jul21 Aug21
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0

How do I create a dummy variable by comparing columns in different data frames?

I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.

Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)

How to display grouped by column during ffill() and not agg using pandas?

This isn't duplicate. I already referred this post_1 and post_2
My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.
I have a dataframe like as given below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
As shown in code above, I tried the below approaches
df2 = df2.groupby(df2['subject_id']).ffill() # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help
Incorrect output without subject_id
I expect my output to have subject_id column as well

Here are 2 possible solutions - specify all columns in list after groupby and assign back:
cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem
Or create index by subject_id column and grouping by index:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()
#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0

pandas dataframe replace NaN with 0 when column name in a certain date range

I have a dataframe like that:
time A time B 2017-11 2017-12 2018-01 2018-02
2017-01-24 2020-01-01 NaN NaN NaN NaN
2016-11-28 2020-01-01 NaN 4.0 2.0 2.0
2017-03-18 2017-12-21 NaN NaN NaN NaN
I want replace all NaN to 0 when the column name between time A and time B. for example, for third row, the time range is from 2017-03-18 to 2017-12-21, so data at the third row with columns name between this range, if it is NaN, replace it with 0, otherwise remain as the same. Hopes its clear. Thanks

Maybe, not the best solution, however it works.
Here's my test sample:
d = pd.DataFrame([
{"time A": "2017-01-24", "time B": np.nan, "2016-11": np.nan, "2016-12": np.nan, "2017-01": np.nan, "2017-02": np.nan},
{"time A": "2016-11-28", "time B": np.nan, "2016-11": np.nan, "2016-12": 4, "2017-01": 2, "2017-02": 2},
{"time A": "2016-12-18", "time B": "2017-01-01", "2016-11": np.nan, "2016-12": np.nan, "2017-01": np.nan, "2017-02": np.nan},
])
d["time B"].fillna("2020-01-01", inplace=True)
d.set_index(["time A", "time B"], inplace=True)
Initial table:
time A time B 2016-11 2016-12 2017-01 2017-02
2017-01-24 2020-01-01 NaN NaN NaN NaN
2016-11-28 2020-01-01 NaN 4.0 2.0 2.0
2016-12-18 2017-01-01 NaN NaN NaN NaN
Looks like time A is open date and time B is close date, or smth like that. Thus for convenience I've filled missing time B with any future date, for example '2020-01-01'
I don't like working with pivot tables, so I've used df.stack() to stack it and formatted date columns:
d_stack = d.stack(dropna=False).reset_index()
d_stack.columns = ["time A", "time B", "month", "value"]
for col in ["time A", "time B"]:
d_stack[col] = pd.to_datetime(d_stack[col], format="%Y-%m-%d", errors="ignore")
d_stack["month"] = pd.to_datetime(d_stack["month"], format="%Y-%m", errors="ignore")
Now it's more convenient to fill missing values
def fill_existing(x):
if (x["time A"] <= x["month"] <= x["time B"] and
np.isnan(x["value"])):
return 0
else:
return x["value"]
d_stack["value"] = d_stack.apply(fill_existing, axis=1)
Output:
time A time B month value
0 2017-01-24 2020-01-01 2016-11-01 NaN
1 2017-01-24 2020-01-01 2016-12-01 NaN
2 2017-01-24 2020-01-01 2017-01-01 NaN
3 2017-01-24 2020-01-01 2017-02-01 0.0
Finally, format month back and pd.pivot_table to return to the initial table format:
d_stack["month"] = d_stack["month"].apply(lambda x: x.strftime("%Y-%m"))
pd.pivot_table(d_stack, columns="month", index=["time A", "time B"],
values="value", aggfunc=np.sum)
Result:
time A time B 2016-12 2017-01 2017-02
2016-11-28 2020-01-01 4.0 2.0 2.0
2016-12-18 2017-01-01 NaN 0.0 NaN
2017-01-24 2020-01-01 NaN NaN 0.0

try this code:
newdf=df[(df.date>some_date) & (df.date<somedate)]
newdf.fillna(0)
newdf is the dataframe you are looking for.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas merge DataFrames like magnetic thing - python

Related

How to create a new column with the last value of the previous year

How to sort columns except index column in a data frame in python after pivot

How do I create a dummy variable by comparing columns in different data frames?

How to display grouped by column during ffill() and not agg using pandas?

pandas dataframe replace NaN with 0 when column name in a certain date range

Categories

Resources