I am looking for an iterative way to append values from a maptable to my df of reference. In a simple form, this can be achieved using a left merge.
The challenge that I am facing are:
The new column might already be present in my df (but containing empty values)
There might be duplicate indices (eg id1 has two times 1 ans two times 2)
In each iteration, I only want to fill the empty values in df (column col_to_map)
To give you an example:
import pandas as pd
df_dict = dict({
'id1': [1, 2, 3, None, 1, 2],
'id2': ['a', 'b', 'c', 'd', None, None],
'val1': [None, None, None, None, None, None],
'val2': ['21a', None, None, None, None, None]
})
map_dict = dict({
'id1': [5, 4, 3, 5, pd.np.nan, pd.np.nan],
'id2': ['e', 'd', None, None, 'b', None],
'val1': ['15e', '14d', '13c', '15e', '12b', 'x1'],
'val2': ['25e', '24d', None, None, None, 'x2'],
'val3': ['35e', '34d', '33c', None, '32b', None],
})
df = pd.DataFrame.from_dict(df_dict)
maptable = pd.DataFrame.from_dict(map_dict)
for id_col in ['id1', 'id2']:
for col_to_map in ['val1','val2','val3']:
print(f'map {col_to_map} using id {id_col}')
# logic here to append only the non-empty values
df = map_iteration(df=df, maptable=maptable, id_col=id_col, col_to_map=col_to_map)
WHAT I TRIED AND WHERE I AM STUCK
I tried the following for map_iteration(), but I receive the error "ValueError: cannot reindex from a duplicate axis":
def map_iteration(df, maptable, id_col, col_to_map):
"""Map empty values in df col_to_map using the maptable and on the id_col"""
# Add column to df
if col_to_map not in df:
df.insert(len(df.columns), column=col_to_map, value=None)
# Take out invalid ids in maptable
maptable = maptable[~maptable[id_col].isnull() & ~maptable[id_col].isna()]
maptable = maptable[~maptable[id_col].duplicated(keep='first')]
# Target rows
elig_rows = df.loc[:, col_to_map].isnull() & ~df.loc[:, id_col].isnull() & ~df.loc[:, id_col].isna()
# To string ids
df.loc[:, id_col] = df.loc[:, id_col].astype(str).str.strip()
maptable.loc[:, id_col] = maptable.loc[:, id_col].astype(str).str.strip()
# Strip maptable
m = maptable.loc[:, [id_col, col_to_map]]
# Merge
df_left = df.loc[elig_rows, [id_col]].merge(m, how='left', on=id_col)
# Indexing
df_left = df_left.set_index(id_col)
df = df.set_index(id_col)
# Assign
df.loc[df.index[elig_rows], col_to_map] = df_left.loc[:, col_to_map]
# Drop index
df = df.reset_index()
return df
I come up with a solution:
for x in df.columns.values:
for ix, val in enumerate(df[x].values):
if not df[x].values[ix]:
df.at[ix, x] = maptable[x].values[ix]
returns:
id1 id2 val1 val2
0 1.0 a 15e 21a
1 2.0 b 14d 24d
2 3.0 c 13c None
3 5.0 d 15e None
Still not the most elegant solution tough, someone can come up with a list comprehension of looping simultaneously two columns of df etc.
Well if this is too cumbersome (and yes it is), one solution may be to finding all null values in your df and fill the nulls with the exact correspondences from the in the map table.
such as:
# Find null places in df
nulls = np.array(df.isnull())
# Find indexes and columns of this null cells
row_ind, col_ind = np.where(nulls ==True )
# Fill them
df_cols = df.columns.values
for ix, col in enumerate(col_ind):
df.at[row_ind[ix], df_cols[col]] = maptable[df_cols[col]].values[row_ind[ix]]
still returns:
id1 id2 val1 val2
0 1.0 a 15e 21a
1 2.0 b 14d 24d
2 3.0 c 13c None
3 5.0 d 15e None
Related
I did a pandas merge and now have two columns - col_x and col_y. I'd like to fill values in col_x with col_y, but only for rows where where col_y is not NaN or has a value. I'd like to keep the original values in col_x and only replace from col_y if NaN.
import pandas as pd
df = pd.DataFrame({
'i': [0, 1, 2, 3],
'c': [np.nan, {'a':'A'}, np.nan, {'b':'B'}],
'd': [{'c':'C'}, np.nan, {'d':'D'}, np.nan]
})
Expected output:
i c d
0 {'c':'C'} {'c':'C'}
1 {'a':'A'} np.nan
2 {'d':'D'} {'d':'D'}
3 {'b':'B'} np.nan
Are you just trying to fillna?
df.c.fillna(df.d, inplace=True)
You can use np.where()
So something like
df['c'] = np.where(df['c'].isna(), df['d'], df['c'])
should do the trick! The first parameter is the condition to check, the second is what to return if the condition is true, and the third is what to return if the condition is false.
Try:
df["c"] = [y if str(x) == "nan" else x for x,y in zip(df.c,df.d)]
Probably cleaner way but this is one line
I've been looking for ways to do this natively for a little while now and can't find a solution.
I have a large dataframe where I would like to set the value in other_col to 'True' for all rows where one of a list of columns is empty.
This works for a single column page_title:
df.loc[df['page_title'].isna(), ['other_col']] = ''
But not when using a list
df.loc[df[['page_title','brand','name']].isna(), ['other_col']] = ''
Any ideas of how I could do this without using Numpy or looping through all rows?
Thanks
Maybe this is what you are looking for:
df = pd.DataFrame({
'A' : ['1', '2', '3', np.nan],
'B': ['10', np.nan, np.nan, '40'],
'C' : ['test', 'test', 'test', 'test']})
df.loc[df[['A', 'B']].isna().any(1), ['C']] = 'value'
print(df)
Result:
A B C
0 1 10 test
1 2 NaN value
2 3 NaN value
3 NaN 40 value
This will allow you to set which columns you want to determine if np.nan is present and set a True/False indicator
data = {
'Column1' : [1, 2, 3, np.nan],
'Column2' : [1, 2, 3, 4],
'Column3' : [1, 2, np.nan, 4]
}
df = pd.DataFrame(data)
df['other_col'] = np.where((df['Column1'].isna()) | (df['Column2'].isna()) | (df['Column3'].isna()), True, False)
df
I want to update the dataframe df with the values coming from another dataframe df_new if some condition hold true.
The indexes and the columns names of the dataframes does not match. How could it be done?
names = ['a', 'b', 'c']
df = pd.DataFrame({
'val': [10, 10, 10],
}, index=names)
new_names = ['a', 'c', 'd']
df_new = pd.DataFrame({
'profile': [5, 15, 22],
}, index=new_names)
above_max = df_new['profile'] >= 7
# This works only if indexes of df and df_new match
#df.loc[above_max, 'val'] = df_new['profile']
# expected df:
# val
# a 10
# b 10
# c 15
One idea with Series.reindex for match index values of mask with another DataFrame:
s = df_new['profile'].reindex(df.index)
above_max = s >= 7
df.loc[above_max, 'val'] = s
I would like to get the records of dataframe df whose values of column c equal to a list of specified quantiles.
for a single quantile this works:
df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], 'C': [1, 2, 3, 4, 5]})
print(df[df['C'] == df['C'].quantile(q = 0.25)])
and outputs:
A C
1 b 2
but it looks clunky to me, and also fails when there are multiple quantiles: print(df[df['C'] == df['C'].quantile(q = [0.25, 0.75])]) throws ValueError: Can only compare identically-labeled Series objects
related to Retrieve the Kth quantile within each group in Pandas
You can do it using this way:
All you have to do is keep your desired quantiles, in a list: as shown below:
You will have your result in final_df
quantile_list = [0.1,0.5,0.4]
final_df = pd.DataFrame(columns = df.columns)
for i in quantile_list:
temp = df[df['c'] == df['c'].quantile(q = i)]
final_df = pd.concat([final_df,temp])
final_df.reset_index(drop=True,inplace=True) #optional incase you want to reset the index
I have column names in a dictionary and would like to select those columns from a dataframe.
In the example below, how do I select dictionary values 'b', 'c' and save it in to df1?
import pandas as pd
ds = {'cols': ['b', 'c']}
d = {'a': [2, 3], 'b': [3, 4], 'c': [4, 5]}
df_in = pd.DataFrame(data=d)
print(ds)
print(df_in)
df_out = df_in[[ds['cols']]]
print(df_out)
TypeError: unhashable type: 'list'
Remove nested list - []:
df_out = df_in[ds['cols']]
print(df_out)
b c
0 3 4
1 4 5
According to ref, just need to drop one set of brackets.
df_out = df_in[ds['cols']]