Groupby, apply function and combine results in dataframe - python

I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?

Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0

Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()

Related

How to fill up missing values in a column of one dataframe based on its common rows in another dataframe in pandas [duplicate]

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

pandas dataframe boolean indexing with multiple conditions from another df

I'm trying to identify the rows between 2 df which shared the same values for some columns for the SAME row.
Example:
import pandas as pd
df = pd.DataFrame([{'energy': 'power', 'id': '123'}, {'energy': 'gas', 'id': '456'}])
df2 = pd.DataFrame([{'energy': 'power', 'id': '456'}, {'energy': 'power', 'id': '123'}])
df =
energy id
0 power 123
1 gas 456
df2 =
energy id
0 power 456
1 power 123
Therefore, I'm trying to get the rows from df where energy & id matches exactly in the same row in df2.
If I do like this, I get a false result:
df2.loc[(df2['energy'].isin(df['energy'])) & (df2['id'].isin(df['id']))]
because this will match the 2 rows of df2 whereas I would expect only power / 123 to be matched
How should I do to do boolean indexing with multiple "dynamic" conditions based on another df rows and matching the values for the same rows in the other df ?
Hope it's clear
pd.merge(df, df2, on=['id','energy'], how='inner')

How to use dropna to drop columns on a subset of columns in Pandas

I want to use Pandas' dropna function on axis=1 to drop columns, but only on a subset of columns with some thresh set. More specifically, I want to pass an argument on which columns to ignore in the dropna operation. How can I do this? Below is an example of what I've tried.
import pandas as pd
df = pd.DataFrame({
'building': ['bul2', 'bul2', 'cap1', 'cap1'],
'date': ['2019-01-01', '2019-02-01', '2019-01-01', '2019-02-01'],
'rate1': [301, np.nan, 250, 276],
'rate2': [250, 300, np.nan, np.nan],
'rate3': [230, np.nan, np.nan, np.nan],
'rate4': [230, np.nan, 245, np.nan],
})
# Only retain columns with more than 3 non-missing values
df.dropna(1, thresh=3)
building date rate1
0 bul2 2019-01-01 301.0
1 bul2 2019-02-01 NaN
2 cap1 2019-01-01 250.0
3 cap1 2019-02-01 276.0
# Try to do the same but only apply dropna to the subset of [building, date, rate1, and rate2],
# (meaning do NOT drop rate3 and rate4)
df.dropna(1, thresh=3, subset=['building', 'date', 'rate1', 'rate2'])
KeyError: ['building', 'date', 'rate1', 'rate2']
# Desired subset of columns against which to apply `dropna`.
cols = ['building', 'date', 'rate1', 'rate2']
# Apply `dropna` and see which columns remain.
filtered_cols = df.loc[:, cols].dropna(axis=1, thresh=3).columns
# Use a conditional list comprehension to determine which columns were dropped.
dropped_cols = [col for col in cols if col not in filtered_cols]
# Use a conditional list comprehension to display all columns other than those that were dropped.
new_cols = [col for col in df if col not in dropped_cols]
>>> df[new_cols]
building date rate1 rate3 rate4
0 bul2 2019-01-01 301.0 230.0 230.0
1 bul2 2019-02-01 NaN NaN NaN
2 cap1 2019-01-01 250.0 NaN 245.0
3 cap1 2019-02-01 276.0 NaN NaN
I find it easiest to first count the number of not null values in each column and then apply your criteria:
# Count not null values in each column
notnulls = df.notnull().sum()
# Find columns with >3 not null values
notnull_cols = notnulls[notnulls>3].index
# Subset df to these columns
df[notnull_cols]

Pandas: How to fill missing Year, Week columns?

I have a dataframe with [Year] & [Week] columns sometimes missing. I have another dataframe that is a calendar for reference from which I can get these missing values. How to fill these missing columns using pandas?
I have tried using reindex to set them up, but I am getting the following error
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7,8,9,10]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
df1 = df1.reindex(df2, fill_value=0)
print(df1)
You should adding index so df2.index
df1.reindex(df2.index,fill_value=0)
Out[851]:
Value
Year Week
2019 1 20
2 40
3 0
4 60
5 0
6 75
7 90
df2.index.difference(df1.index)
Out[854]:
MultiIndex(levels=[[2019], [3, 5]],
labels=[[0, 0], [0, 1]],
names=['Year', 'Week'],
sortorder=0)
Update
s=df1.reindex(df2.index)
s[s.bfill().notnull().values].fillna(0)
Out[877]:
Value
Year Week
2019 1 20.0
2 40.0
3 0.0
4 60.0
5 0.0
6 75.0
7 90.0
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
fill_value = df1['Value'].mean() #value to fill `NaN` rows with - can choose another logic if you do not want the mean
df1 = df1.join(df2, how='right')
df1.fillna(value=fill_value,axis=1) # Fill missing data here
print(df1)

extract values from a data frame

The first and the second data frames are as below:
import pandas as pd
d = {'0': [2154,799,1023,4724], '1': [27, 2981, 952,797],'2':[4905,569,4767,569]}
df1 = pd.DataFrame(data=d)
and
d={'PART_NO': ['J661-03982','661-08913', '922-8972','661-00352','661-06291',''], 'PART_NO_ENCODED': [2154,799,1023,27,569]}
df2 = pd.DataFrame(data=d)
I want to get the corresponding part_no for each row in df1 so the resulting data frame should look like this:
d={'PART_NO': ['J661-03982','661-00352',''], 'PART_NO_ENCODED': [2154,27,4905]}
df3 = pd.DataFrame(data=d)
This I can achieve like this:
df2.set_index('PART_NO_ENCODED').reindex(df1.iloc[0,:]).reset_index().rename(columns={0:'PART_NO_ENCODED'})
But instead of passing reindex(df1.iloc[0,:]) one value that's 0,1 at a Time I want to get for all the rows in df1 the corresponding part_no. Please help?
You can use the second dataframe as a dictionary of replacements:
df3 = df1.replace(df2.set_index('PART_NO_ENCODED').to_dict()['PART_NO'])
The values that are not in df2, will not be replaced. They have to be identified and discarded:
df3 = df3[df1.isin(df2['PART_NO_ENCODED'].tolist())]
# 0 1 2
#0 J661-03982 661-00352 NaN
#1 661-08913 NaN 661-06291
#2 922-8972 NaN NaN
#3 NaN NaN 661-06291
You can later replace the missing values with '' or any other value of your choice with fillna.

Categories