Grouper() and agg() functions produce multiple copies when squashed - python

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.

In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Related

Python How to combine two rows into one under multiple rules

I try to combine many pairs of rows when run the code one time. As my example shows, for two rows which can be combined, the rules are,
values in PT, DS, SC columns must be same.
time stamps in FS must be the closest pair.
combine on ID column (string) is like ID1,ID2.
combine on WT and CB column (number) is sum().
combine on FS is as the latest time.
My example is,
df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})
When run the code one time, the new dataframe df1 is,
df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})
When run the code again on df1, the new dataframe df2 is,
df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
'PT':['B','B','B','B','D','D','F'],
'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
'SC':['P1','P2','P1','P2','P1','P2','P2'],
'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
'WT':[10,46,86,126,123,43,53],
'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})
Here no more combines can be done on df2 because no any pair of rows meets the rules.
The reason is that I have memory limit and have to decrease the size of data without losing the info. So I try to bundle IDs which shares same features and happens close to each other. I plan to run the code multiple times until no more memory issue or no more possible combines.
This is a good place to use GroupBy operations.
My source was Wes McKinney's Python for Data Analysis.
df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))
max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)
sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()
df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])
This code just takes the most recent time for each unique grouping of the columns you specified. If there are other requirements for the FS column, you will have to modify this.
Code to concatenate the IDs came from:
Concatenate strings from several rows using Pandas groupby
Perhaps there's something more straightforward (please comment if so :)
but the following seems to work:
def combine(data):
return pd.DataFrame(
{
"ID": ",".join(map(str, data["ID"])),
"PT": data["PT"].iloc[0],
"DS": data["DS"].iloc[0],
"SC": data["SC"].iloc[0],
"WT": data["WT"].sum(),
"CB": data["CB"].sum(),
"FS": data["FS"].max(),
},
index=[0],
).reset_index(drop=True)
df_agg = (
df.sort_values(["PT", "DS", "SC", "FS"])
.groupby(["PT", "DS", "SC"])
.apply(combine)
.reset_index(drop=True)
)
returns
ID PT DS SC WT CB FS
0 1001,1002,1003,1004 B AAA P1 10 1.0 2020-10-16 00:00:04
1 2001,2002,2003,2004 B AAA P2 46 4.6 2020-10-16 00:00:03
2 3001,3002,3003,3004 B AAB P1 86 8.6 2020-10-16 00:00:07
3 4001,4002,4003,4004 B AAB P2 126 12.6 2020-10-16 00:10:40
4 5001,5003,5002 D AAA P1 123 12.3 2020-10-16 00:10:00
5 5004 D AAB P2 43 4.3 2020-10-16 00:00:10
6 6001 F AAB P2 53 5.3 2020-10-16 00:00:05

Python (Pandas) How to merge 2 dataframes with different dates in incremental order?

I am trying to merge 2 dataframes by date index in order. Is this possible?
A sample code of what I need to manipulate
Link for sg_df:https://query1.finance.yahoo.com/v7/finance/download/%5ESTI?P=^STI?period1=1442102400&period2=1599955200&interval=1mo&events=history
Link for facemask_compliance_df: https://today.yougov.com/topics/international/articles-reports/2020/05/18/international-covid-19-tracker-update-18-may (YouGov COVID-19 behaviour changes tracker: Wearing a face mask when in public places)
# Singapore Index
# Read file
# Format Date
# index date column for easy referencing
sg_df = pd.read_csv("^STI.csv")
conv = lambda x: datetime.strptime(x, "%d/%m/%Y")
sg_df["Date"] = sg_df["Date"].apply(conv)
sg_df.sort_values("Date", inplace = True)
sg_df.set_index("Date", inplace = True)
# Will wear face mask in public
# Read file
# Format Date, Removing time
# index date column for easy referencing
facemask_compliance_df = pd.read_csv("yougov-chart.csv")
convert1 = lambda x: datetime.strptime(x, "%d/%m/%Y %H:%M")
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].apply(convert1).dt.date
facemask_compliance_df.sort_values("DateTime", inplace = True)
facemask_compliance_df.set_index("DateTime", inplace = True)
sg_df = sg_df.merge(facemask_compliance_df["Singapore"], left_index = True, right_index = True, how = "outer").sort_index()
and I wish to output a table kind of like this.
Kindly let me know if you need any more info, I will kindly provide them to you shortly if I am able to.
Edit:
This is the issue
data from yougov-chart
I think it is reading the dates even when it is not from Singapore
Use:
merge to merge to tables.
1.1. on to choose on which column to merge:
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
1.2. outer option:
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
sort_values to sort by date
import pandas as pd
df1 = pd.read_csv("^STI.csv")
df1['Date'] = pd.to_datetime(df1.Date)
df2 = pd.read_csv("yougov-chart.csv")
df2['Date'] = pd.to_datetime(df2.DateTime)
result = df2.merge(df1, on='Date', how='outer')
result = result.sort_values('Date')
print(result)
Output:
Date US_GDP_Thousands Mask Compliance
6 2016-02-01 NaN 37.0
7 2017-07-01 NaN 73.0
8 2019-10-01 NaN 85.0
0 2020-02-21 50.0 27.0
1 2020-03-18 55.0 NaN
2 2020-03-19 60.0 NaN
3 2020-03-25 65.0 NaN
4 2020-04-03 70.0 NaN
5 2020-05-14 75.0 NaN
First use parameters parse_dates and index_col in read_csv for DatetimeIndex in both and in second remove times by Series.dt.floor:
sg_df = pd.read_csv("^STI.csv",
parse_dates=['Date'],
index_col=['Date'])
facemask_compliance_df = pd.read_csv("yougov-chart.csv",
parse_dates=['DateTime'],
index_col=['DateTime'])
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].dt.floor('d')
Then use DataFrame.merge by index by outer join and then sort index by DataFrame.sort_index:
df = sg_df.merge(facemask_compliance_df,
left_index=True,
right_index=True,
how='outer').sort_index()
print (df)
Mask Compliance US_GDP_Thousands
Date
2016-02-01 37.0 NaN
2017-07-01 73.0 NaN
2019-10-01 85.0 NaN
2020-02-21 27.0 50.0
2020-03-18 NaN 55.0
2020-03-19 NaN 60.0
2020-03-25 NaN 65.0
2020-04-03 NaN 70.0
2020-05-14 NaN 75.0
If i remember right In numpy you can do v.stack or h.stack. depends on how you want to join them together.
in pandas there was something like concatenate https://pandas.pydata.org/docs/user_guide/merging.html which i used for merging dataframes

update data frame based on data from another data frame using pandas python

I have two data frames df1 and df2. Both have first column common SKUCode=SKU
df1:
df2:
I want to update df1 and set SKUStatus=0 if SKUCode matches SKU in df2.
I want to add new row to df1 if SKU from df2 has no match to SKUCode.
So after the operation df1 looks like following:
One way I could get this done is via df2.iterrows() and looping through values however I think there must be another neat way of doing this?
Thank you
import pandas as pdx
df1=pdx.DataFrame({'SKUCode':['A','B','C','D'],'ListPrice':[1798,2997,1798,999],'SalePrice':[1798,2997,1798,999],'SKUStatus':[1,1,1,0],'CostPrice':[500,773,525,300]})
df2=pdx.DataFrame({'SKUCode':['X','Y','B'],'Status':[0,0,0],'e_date':['31-05-2020','01-06-2020','01-06-2020']})
df1.merge(df2,left_on='SKUCode')
try this, using outer merge which gives both matching and non-matching records.
In [75]: df_m = df1.merge(df2, on="SKUCode", how='outer')
In [76]: mask = df_m['Status'].isnull()
In [77]: df_m.loc[~mask, 'SKUStatus'] = df_m.loc[~mask, 'Status']
In [78]: df_m[['SKUCode', "ListPrice", "SalePrice", "SKUStatus", "CostPrice"]].fillna(0.0)
output
SKUCode ListPrice SalePrice SKUStatus CostPrice
0 A 1798.0 1798.0 1.0 500.0
1 B 2997.0 2997.0 0.0 773.0
2 C 1798.0 1798.0 1.0 525.0
3 D 999.0 999.0 0.0 300.0
4 X 0.0 0.0 0.0 0.0
5 Y 0.0 0.0 0.0 0.0
I'm not sure exactly if I understood you correctly but I think you can use .loc. something along the lines of:
df1.loc[df2['SKUStatu'] != 0, 'SKUStatus'] = 1
You should have a look at pd.merge function [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html].
First rename a column with the same name (e.g rename SKU to SKUCode). Then try:
df1.merge(df2, left_on='SKUCode')
If you provide input data (not screenshots), I can try with the appropriate parameters.

Pandas - Calculate row values based on prior row value, update the result to be the new row value (and so on)

Below is some dummy data that reflects the data I am working with.
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
# Dummy data that represents a percent change
datelist = pd.date_range(start='1983-01-01', end='1994-01-01', freq='Y')
df1 = pd.DataFrame({"P Change_1": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,)),
"P Change_2": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,))})
#This dataframe contains the rows we want to operate on
df2 = pd.DataFrame({
'Loc1': [None, None, None, None, None, None, None, None, None, None, 2.5415],
'Loc2': [None, None, None, None, None, None, None, None, None, None, 3.2126],})
#Set the datetime index
df1 = df1.set_index(datelist)
df2 = df2.set_index(datelist)
df1:
P Change_1 P Change_2
1984-12-31 -0.172080 -0.231574
1985-12-31 -0.328773 -0.247018
1986-12-31 -0.160834 -0.099079
1987-12-31 -0.457924 0.000266
1988-12-31 0.017374 -0.501916
1989-12-31 -0.349052 -0.438816
1990-12-31 0.034711 0.036164
1991-12-31 -0.415445 -0.415372
1992-12-31 -0.206852 -0.413107
1993-12-31 -0.313341 -0.181030
1994-12-31 -0.474234 -0.118058
df2:
Loc1 Loc2
1984-12-31 NaN NaN
1985-12-31 NaN NaN
1986-12-31 NaN NaN
1987-12-31 NaN NaN
1988-12-31 NaN NaN
1989-12-31 NaN NaN
1990-12-31 NaN NaN
1991-12-31 NaN NaN
1992-12-31 NaN NaN
1993-12-31 NaN NaN
1994-12-31 2.5415 3.2126
DataFrame details:
First off, Loc1 will correspond with P Change_1 and Loc2 corresponds to P Change_2, etc. Looking at Loc1 first, I want to either fill up the DataFrame containing Loc1 and Loc2 with the relevant values or compute a new dataframe that has columns Calc1 and Calc2.
The calculation:
I want to start with the 1994 value of Loc1 and calculate a new value for 1993 by taking Loc1 1993 = Loc1 1994 + (Loc1 1994 * P Change_1 1993). With the values filled in it would be 2.5415 +(-0.313341 * 2.5415) which equals about 1.74514.
This 1.74514 value will replace the NaN value in 1993, and then I want to use that calculated value to get a value for 1992. This means we now compute Loc1 1992 = Loc1 1993 + (Loc1 1993 * P Change_1 1992). I want to carry out this operation row-wise until it gets the earliest value in the timeseries.
What is the best way to go about implementing this row-wise equation? I hope this makes some sense and any help is greatly appreciated!
df = pd.merge(df1, df2, how='inner', right_index=True, left_index=True) # merging dataframes on date index
df['count'] = range(len(df)) # creating a column, count for easy operation
# divides dataframe in two part, one part above the not NaN row and one below
da1 = df[df['count']<=df.dropna().iloc[0]['count']]
da2 = df[df['count']>=df.dropna().iloc[0]['count']]
da1.sort_values(by=['count'],ascending=False, inplace=True)
g=[da1,da2]
num_col=len(df1.columns)
for w in range(len(g)):
list_of_col=[]
count = 0
list_of_col=[list() for i in range(len(g[w]))]
for item, rows in g[w].iterrows():
n=[]
if count==0:
for p in range(1,num_col+1):
n.append(rows[f'Loc{p}'])
else:
for p in range(1,num_col+1):
n.append(list_of_col[count-1][p-1]+ list_of_col[count-1][p-1]* rows[f'P Change_{p}'])
list_of_col[count].extend(n)
count+=1
tmp=[list() for i in range(num_col)]
for d_ in range(num_col):
for x_ in range(len(list_of_col)):
tmp[d_].append(list_of_col[x_][d_])
z1=[]
z1.extend(tmp)
for i in range(num_col):
g[w][f'Loc{i+1}']=z1[i]
da1.sort_values(by=['count'] ,inplace=True)
final_df = pd.concat([da1, da2[1:]])
calc_df = pd.DataFrame()
for i in range(num_col):
calc_df[f'Calc{i+1}']=final_df[f'Loc{i+1}']
print(calc_df)
I have tried to include all the obscure thing I have done in the comment. I have edited my code to let initial dataframes remain unaffected.
[Edited] : I have edited the code to include any number of columns in the given dataframe.
[Edited:]If the name of columns are arbitrary in df1 and df2, please run this block of code before running the upper code. I have renamed the columns name using list comprehension!
df1.columns = [f'P Change_{i+1}' for i in range(len(df1.columns))]
df2.columns = [f'Loc{i+1}' for i in range(len(df2.columns))]
[EDITED] Perhaps there are better/more elegant ways to do this, but this worked fine for me:
def fill_values(df1, df2, cols1=None, cols2=None):
if cols1 is None: cols1 = df1.columns
if cols2 is None: cols2 = df2.columns
for i in reversed(range(df2.shape[0]-1)):
for col1, col2 in zip(cols1, cols2):
if np.isnan(df2[col2].iloc[i]):
val = df2[col2].iloc[i+1] + df2[col2].iloc[i+1] * df1[col1].iloc[i]
df2[col2].iloc[i] = val
return df1, df2
df1, df2 = fill_values(df1, df2)
print(df2)
Loc1 Loc2
1983-12-31 0.140160 0.136329
1984-12-31 0.169291 0.177413
1985-12-31 0.252212 0.235614
1986-12-31 0.300550 0.261526
1987-12-31 0.554444 0.261457
1988-12-31 0.544976 0.524925
1989-12-31 0.837202 0.935388
1990-12-31 0.809117 0.902741
1991-12-31 1.384158 1.544128
1992-12-31 1.745144 2.631024
1993-12-31 2.541500 3.212600
This assumes that the rows in df1 and df2 corresponds perfectly (I'm not querying the index, but only the location). Hope it helps!
Just to be clear, what you need is Loc1[year]=Loc1[next_year] + PChange[year]*Loc1[next_year], right?
The below loop will do what you are looking for, but it just assumes that the number of rows in both df's is always equal, etc. (instead of matching the value in the index). From your description, I think this works for your data.
for i in range(df2.shape[0]-2,-1,-1):
df2.Loc1[i]=df2.Loc1[i+1] + (df1.PChange_1[i]*df2.Loc1[i+1])
Hope this helps :)

Using .ix loses headers

How come when I use:
dfa=df1["10_day_mean"].ix["2015":"2015"]
The dataframe dfa has no header?
dfa:
Date
2015-01-10 2.000000
2015-01-20 3.000000
df1:
10day_mean Attenuation Channel1 Channel2 Channel3 Channel4 \
Date
2004-02-27 3.025 2.8640 NaN NaN NaN NaN
Is there a way to change the header of the dfa because when I plot it out my legend is the 10_day_mean and I wish to relable it as "Daily mean of every 10 days"
Thanks guys
I tried
dfa=dfa.rename(columns={0:"rename"})
and
dfa=dfa.rename(columns={"10day_mean":"rename"})
But then it says none
Your confusion here is that when you do this:
dfa=df1["10_day_mean"].ix["2015":"2015"]
this returns a Series which only has a single column so the output doesn't show the column name above the column, it'll show it at the bottom in the summary info as name.
To get the output you desired you can use double subscripting to force a dataframe with a single column to be returned:
dfa=df1[["10_day_mean"]].ix["2015":"2015"]
Example:
In [90]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[90]:
a b c
0 -1.002036 -1.703049 2.123096
1 0.497920 1.556211 -1.807895
2 0.400020 -0.703138 1.452735
3 -0.296604 -0.227155 -0.311047
4 -0.314948 -0.654925 -0.434458
In [91]:
df['a'].iloc[2:4]
Out[91]:
2 0.400020
3 -0.296604
Name: a, dtype: float64
In [92]:
df[['a']].iloc[2:4]
Out[92]:
a
2 0.400020
3 -0.296604

Categories