Map dictionary values to key values in a dataframe column - python

I have a dataframe:
values
NaN
NaN
[1,2,5]
[2]
[5]
And a dictionary
{nan: nan,
'1': '10',
'2': '11',
'5': '12',}
The dataframe contains keys from the dictionary.
How can I replace these keys with the corresponding values from the same dictionary?
Output:
values
NaN
NaN
[10,11,12]
[11]
[12]
I have tried
so_df['values'].replace(my_dictionary, inplace=True)
so_df.head()

You can use apply() method of pandas df. Check the implementation below:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.nan,
np.nan,
['1', '2', '5'],
['2'],
['5']], columns=['values'])
my_dict = {np.nan: np.nan,
'1': '10',
'2': '11',
'5': '12'}
def update(row):
if isinstance(row['values'], list):
row['values'] = [my_dict.get(val) for val in row['values']]
else:
row['values'] = my_dict.get(row['values'])
return row
df = df.apply(lambda row: update(row), axis=1)
Simple implementation. Just make sure if your dataframe contains string, your dictionary keys also contains string.

Try:
df['values']=pd.to_numeric(df['values'].explode().astype(str).map(my_dict), errors='coerce').groupby(level=0).agg(list)

Setup
import numpy as np
df=pd.DataFrame({'values':[np.nan,np.nan,[1,2,5],[2],5]})
my_dict={np.nan: np.nan, '1': '10', '2': '11', '5': '12'}
Use Series.explode with Series.map
df['values']=( df['values'].explode()
.astype(str)
.map(my_dict)
.dropna()
.astype(int)
.groupby(level = 0)
.agg(list) )
If there are others strings in your values column you would need pd.to_numeric with errors = coerce, to keep it you should do:
df['values']=(pd.to_numeric( df['values'].explode()
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)
Output
values
0 NaN
1 NaN
2 [10, 11, 12]
3 [11]
4 [12]
UPDATE
solution without explode
df['values']=(pd.to_numeric( df['values'].apply(pd.Series)
.stack()
.reset_index(level=1,drop=1)
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)

Related

Apply a function on two pandas tables

I have the following two tables:
>>> df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
... '2': ['mike', '30', 'ana'],
... '3': ['ana', '20', 'mike'],
... '4': ['eve', 'eve', 'eve'],
... '5': ['10', np.NaN, '10'],
... '6': [np.NaN, np.NaN, '20']},
... index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df1
1 2 3 4 5 6
index
ind1 john mike ana eve 10 NaN
ind2 10 30 20 eve NaN NaN
ind3 john ana mike eve 10 20
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
>>> df2
first_n
index
ind1 4
ind2 4
ind3 3
I also have the following function that reverses a list and gets the first n non-NA elements:
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[:top_n]
>>> get_rev_first_n(['john', 'mike', 'ana', 'eve', '10', np.NaN], 4)
['10', 'eve', 'ana', 'mike']
How would I apply this function to the two tables so that it takes in both df1 and df2 and outputs either a list or columns?
df=pd.concat([df1,df2],axis=1)
df.apply(get_rev_first_n,args=[4]) #send args as top_in
axis=0 is run along rows means runs on each column which is the default you don't have to specify it
args=[4] will be passed to second argument of get_rev_first_n
You can try apply with lambda on each row of the data frame, I just concatenate the two df's using concat and applied your method to each row of the resulted dataframe.
Full Code:
import pandas as pd
import numpy as np
def get_rev_first_n(row, top_n):
rev_row = [x for x in row[::-1] if x == x]
return rev_row[1:top_n]
df1 = pd.DataFrame(data={'1': ['john', '10', 'john'],
'2': ['mike', '30', 'ana'],
'3': ['ana', '20', 'mike'],
'4': ['eve', 'eve', 'eve'],
'5': ['10', np.NaN, '10'],
'6': [np.NaN, np.NaN, '20']},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df2 = pd.DataFrame(data={'first_n': [4, 4, 3]},
index=pd.Series(['ind1', 'ind2', 'ind3'], name='index'))
df3 = pd.concat([df1, df2.reindex(df1.index)], axis=1)
df = df3.apply(lambda row : get_rev_first_n(row, row['first_n']), axis = 1)
print(df)
Output:
index
ind1 [10, eve, ana]
ind2 [eve, 20, 30]
ind3 [20, 10]
dtype: object

How can I add column with the dates (starting from today)?

How can I add to column C the dates (starting from today)? Column B is an example of what I want to get.
df = pd.DataFrame({'N': ['1', '2', '3', '4'], 'B': ['16.11.2021', '17.11.2021', '18.11.2021', '19.11.2021'], 'C': ['nan', 'nan', 'nan', 'nan']})
If I understood your question correctly, you want something like this:
import datetime
base = datetime.datetime.today()
date_list = sorted([(base - datetime.timedelta(days=x)).strftime('%d.%m.%Y') for x in range(len(df))])
df['C'] = date_list
This will produce the same result as in column B.

Merging dataframes based on a column 's values of a dataframe

I have a dictionary of dataframes (df_t). Each dataframe has the same exactly same two columns but different number of rows . I merged all the dataframes to one dataframe.
However, I would like to use (df_A [label_1]) as a reference and drop the (df_B [label_2]) (df_A [label_3]) like in the example
df_A = pd.DataFrame({'A': [1, 0.5, 1, 0.5],'label_1': ['-1', '1', '-1', '1']})
df_B = pd.DataFrame({'A': [1, 1.5, 2.3],'label_2': ['-1', '1','-1']})
df_C = pd.DataFrame({'A': [2.1, 5.5],'label_3': ['-1', '1']})
df_t = {'1': df_A, '2': df_B, '3': df_C}
#d = { k: v.set_index('label') for k, v in df_t.items()}
dfx = pd.concat(df_t, axis=1)
dfx.columns = dfx.columns.droplevel(0)

Which Pandas function do I need? group_by or pivot

I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()

How do I remove decimals from Pandas to_dict() output

The gist of this post is that I have "23" in my original data, and I want "23" in my resulting dict (not "23.0"). Here's how I've tried to handle it with Pandas.
My Excel worksheet has a coded Region column:
23
11
27
(blank)
25
Initially, I created a dataframe and Pandas set the dtype of Region to float64*
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0)
df
23.0
11.0
27.0
NaN
25.0
Pandas will convert the dtype to object if I use fillna() to replace NaN's with blanks which seems to eliminate the decimals.
df.fillna('', inplace=True)
df
23
11
27
(blank)
25
Except I still get decimals when I convert the dataframe to a dict:
data = df.to_dict('records')
data
[{'region': 23.0,},
{'region': 27.0,},
{'region': 11.0,},
{'region': '',},
{'region': 25.0,}]
Is there a way I can create the dict without the decimal places? By the way, I'm writing a generic utility, so I won't always know the column names and/or value types, which means I'm looking for a generic solution (vs. explicitly handling Region).
Any help is much appreciated, thanks!
The problem is that after fillna('') your underlying values are still float despite the column being of type object
s = pd.Series([23., 11., 27., np.nan, 25.])
s.fillna('').iloc[0]
23.0
Instead, apply a formatter, then replace
s.apply('{:0.0f}'.format).replace('nan', '').to_dict()
{0: '23', 1: '11', 2: '27', 3: '', 4: '25'}
Using a custom function, takes care of integers and keeps strings as strings:
import pprint
def func(x):
try:
return int(x)
except ValueError:
return x
df = pd.DataFrame({'region': [1, 2, 3, float('nan')],
'col2': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'col2': 'a', 'region': 1},
{'col2': 'b', 'region': 2},
{'col2': 'c', 'region': 3},
{'col2': '', 'region': ''}]
A variation that also keeps floats as floats:
import pprint
def func(x):
try:
if int(x) == x:
return int(x)
else:
return x
except ValueError:
return x
df = pd.DataFrame({'region1': [1, 2, 3, float('nan')],
'region2': [1.5, 2.7, 3, float('nan')],
'region3': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'region1': 1, 'region2': 1.5, 'region3': 'a'},
{'region1': 2, 'region2': 2.7, 'region3': 'b'},
{'region1': 3, 'region2': 3, 'region3': 'c'},
{'region1': '', 'region2': '', 'region3': ''}]
You could add: dtype=str
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0, dtype=str)

Categories