How do you pass a dictionary through the pandas replace function? [duplicate]

How do you pass a dictionary through the pandas replace function? [duplicate] - python

This question already has answers here:
Pandas - replacing column values
(5 answers)
Closed 4 years ago.
How would you use a dictionary to replace values throughout all the columns in a dataframe?
Below is an example, where I attempted to pass the dictionary through the replace function. I have a script that produces various sized dataframes based on a company's manager/employee structure - so the number of columns varies per dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace({di})
print(df)
There is a similar question linked below in which the solution specifies column names, but given that I'm looking at a whole dataframe that would vary in column names/size, I'd like to apply the replace function to the whole dataframe.
Remap values in pandas column with a dict
Thanks!

Don't put {} around your dictionary - that tries to turn it into a set, which throws an error since a dictionary can't be an element in a set. Instead pass in the dictionary directly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace(di)
print(df)
Output:
col1 col2 col3 col4
0 w a w w
1 1 2 1 1
2 2 NaN 2 2
col1 col2 col3 col4
0 w a w w
1 A B A A
2 B NaN B B

Related

Filter for column value and set its other column to an array

I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!

here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)

Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5

Finding unique values of one column for specific column values

given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?

Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64

Pandas Columns to Flattened Dictionary (instead of list of dictionaries)

I have a DF that looks like this.
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3}, 'Value': {0: 'a', 1: 'b', 2: np.nan}})
ID
Value
0
1
a
1
2
b
2
3
c
I'd like to create a dictionary out of it.
So if I run df.to_dict('records'), it gives me
[{'Visual_ID': 1, 'Customer': 'a'},
{'Visual_ID': 2, 'Customer': 'b'},
{'Visual_ID': 3, 'Customer': 'c'}]
However, what I want is the following.
{
1: 'a',
2: 'b',
3: 'c'
}
All of the rows in the DF or unique, so it shouldn't run into same key names issue.

Try with
d = dict(zip(df.ID, df.Value))

Regex replace within list inside a dataframe in python

Hello I have dataframe such as :
COL1 COL2 COL3
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]]
how can I use re.sub(r'.*__', '') within the COL3 sulist ?
and get a new column without evrything before '__':
COL1 COL2 COL3 COL4
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]] [[(HELLO,OBY),(HOLLA,BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]] [(BO,GET)]]
here is the data :
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)

Updated data solution
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
df['COL4'] = df['COL3 '].str.replace(r"([,(])[^(),]*__", r"\1")
df['COL4']
# => 0 [[(HELLO,BY),(HOLLA,BY)]]
# 1 [[(BO,GET)]]
# Name: COL4, dtype: object
See the regex demo.
Old data solution
You can use ast.literal_eval to turn the strings in the COL3 column into lists of lists and iterate over them while modifying the tuple items:
import ast
import pandas as pd
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[('OK2_+__HELLO','OJ_+__BY'),('LO_-__HOLLA','KUOJ_+__BY')]]", 1: "[[('JU3_+__BO','UJ3_-__GET')]]"}}
df = pd.DataFrame.from_dict(data)
def repl(m):
result = []
for l in ast.literal_eval(m):
ll = []
for x, y in l:
ll.append(tuple([re.sub(r'.*__', '', x), re.sub(r'.*__', '', y)]))
result.append(ll)
return str(result)
df['COL4'] = df['COL3 '].apply(repl)
df['COL4']
# => 0 [[('HELLO', 'BY'), ('HOLLA', 'BY')]]
# 1 [[('BO', 'GET')]]
You do not need to use str(result) if you are OK to keep the result as a list of lists.

Performing groupby function using two columns as parameters regardless of the order of the columns

Given the following dataframe:
Node_1 Node_2 Time
A B 6
A B 4
B A 2
B C 5
How can one obtain, using groupby or other methods, the dataframe as follows:
Node_1 Node_2 Mean_Time
A B 4
B C 5
The first row's Mean_Time being obtained by finding the average of all routes A->B and B->A, i.e. (6 + 4 + 2)/3 = 4

You could sort each row of the Node_1 and Node_2 columns using np.sort:
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
df.loc[:, nodes.columns] = arr
which results in df now looking like:
Node_1 Node_2 Time
0 A B 6
1 A B 4
2 A B 2
3 B C 5
With the Node columns sorted, you can groupby/agg as usual:
result = df.groupby(cols).agg('mean').reset_index()
import numpy as np
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
cols = nodes.columns.tolist()
df.loc[:, nodes.columns] = arr
result = df.groupby(cols).agg('mean').reset_index()
print(result)
yields
Node_1 Node_2 Time
0 A B 4
1 B C 5

Something in the lines of should give you the desired result... This got a lot uglier than it was :D
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
# Create new column to group by
df["Node"] = df[["Node_1","Node_2"]].apply(lambda x: tuple(sorted(x)),axis=1)
# Create Mean_time column
df["Mean_time"] = df.groupby('Node').transform('mean')
# Drop duplicate rows and drop Node and Time columns
df = df.drop_duplicates("Node").drop(['Node','Time'],axis=1)
print(df)
Returns:
Node_1 Node_2 Mean_time
0 A B 4
3 B C 5
An alternative would be to use:
df = (df.groupby('Node', as_index=False)
.agg({'Node_1':lambda x: list(x)[0],
'Node_2':lambda x: list(x)[0],
'Time': np.mean})
.drop('Node',axis=1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do you pass a dictionary through the pandas replace function? [duplicate] - python

Related

Filter for column value and set its other column to an array

Finding unique values of one column for specific column values

Pandas Columns to Flattened Dictionary (instead of list of dictionaries)

Regex replace within list inside a dataframe in python

Performing groupby function using two columns as parameters regardless of the order of the columns

Categories

Resources