This question already has answers here:
Pandas - replacing column values
(5 answers)
Closed 4 years ago.
How would you use a dictionary to replace values throughout all the columns in a dataframe?
Below is an example, where I attempted to pass the dictionary through the replace function. I have a script that produces various sized dataframes based on a company's manager/employee structure - so the number of columns varies per dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace({di})
print(df)
There is a similar question linked below in which the solution specifies column names, but given that I'm looking at a whole dataframe that would vary in column names/size, I'd like to apply the replace function to the whole dataframe.
Remap values in pandas column with a dict
Thanks!
Don't put {} around your dictionary - that tries to turn it into a set, which throws an error since a dictionary can't be an element in a set. Instead pass in the dictionary directly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}, 'col3': {0: 'w', 1: 1, 2: 2}, 'col4': {0: 'w', 1: 1, 2: 2}})
di = {1: "A", 2: "B"}
print(df)
df = df.replace(di)
print(df)
Output:
col1 col2 col3 col4
0 w a w w
1 1 2 1 1
2 2 NaN 2 2
col1 col2 col3 col4
0 w a w w
1 A B A A
2 B NaN B B
Related
I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!
here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)
Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5
given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?
Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64
I have a DF that looks like this.
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3}, 'Value': {0: 'a', 1: 'b', 2: np.nan}})
ID
Value
0
1
a
1
2
b
2
3
c
I'd like to create a dictionary out of it.
So if I run df.to_dict('records'), it gives me
[{'Visual_ID': 1, 'Customer': 'a'},
{'Visual_ID': 2, 'Customer': 'b'},
{'Visual_ID': 3, 'Customer': 'c'}]
​However, what I want is the following.
{
1: 'a',
2: 'b',
3: 'c'
}
All of the rows in the DF or unique, so it shouldn't run into same key names issue.
Try with
d = dict(zip(df.ID, df.Value))
Hello I have dataframe such as :
COL1 COL2 COL3
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]]
how can I use re.sub(r'.*__', '') within the COL3 sulist ?
and get a new column without evrything before '__':
COL1 COL2 COL3 COL4
G1 1 [[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]] [[(HELLO,OBY),(HOLLA,BY)]]
G1 2 [[(JU3_+__BO,UJ3_-__GET)]] [(BO,GET)]]
here is the data :
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
Updated data solution
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[(OK2_+__HELLO,OJ_+__BY),(LO_-__HOLLA,KUOJ_+__BY)]]", 1: "[[(JU3_+__BO,UJ3_-__GET)]]"}}
df = pd.DataFrame.from_dict(data)
df['COL4'] = df['COL3 '].str.replace(r"([,(])[^(),]*__", r"\1")
df['COL4']
# => 0 [[(HELLO,BY),(HOLLA,BY)]]
# 1 [[(BO,GET)]]
# Name: COL4, dtype: object
See the regex demo.
Old data solution
You can use ast.literal_eval to turn the strings in the COL3 column into lists of lists and iterate over them while modifying the tuple items:
import ast
import pandas as pd
data= {'COL1': {0: 'G1', 1: 'G1'}, 'COL2': {0: 1, 1: 2}, 'COL3 ': {0: "[[('OK2_+__HELLO','OJ_+__BY'),('LO_-__HOLLA','KUOJ_+__BY')]]", 1: "[[('JU3_+__BO','UJ3_-__GET')]]"}}
df = pd.DataFrame.from_dict(data)
def repl(m):
result = []
for l in ast.literal_eval(m):
ll = []
for x, y in l:
ll.append(tuple([re.sub(r'.*__', '', x), re.sub(r'.*__', '', y)]))
result.append(ll)
return str(result)
df['COL4'] = df['COL3 '].apply(repl)
df['COL4']
# => 0 [[('HELLO', 'BY'), ('HOLLA', 'BY')]]
# 1 [[('BO', 'GET')]]
You do not need to use str(result) if you are OK to keep the result as a list of lists.
Given the following dataframe:
Node_1 Node_2 Time
A B 6
A B 4
B A 2
B C 5
How can one obtain, using groupby or other methods, the dataframe as follows:
Node_1 Node_2 Mean_Time
A B 4
B C 5
The first row's Mean_Time being obtained by finding the average of all routes A->B and B->A, i.e. (6 + 4 + 2)/3 = 4
You could sort each row of the Node_1 and Node_2 columns using np.sort:
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
df.loc[:, nodes.columns] = arr
which results in df now looking like:
Node_1 Node_2 Time
0 A B 6
1 A B 4
2 A B 2
3 B C 5
With the Node columns sorted, you can groupby/agg as usual:
result = df.groupby(cols).agg('mean').reset_index()
import numpy as np
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
cols = nodes.columns.tolist()
df.loc[:, nodes.columns] = arr
result = df.groupby(cols).agg('mean').reset_index()
print(result)
yields
Node_1 Node_2 Time
0 A B 4
1 B C 5
Something in the lines of should give you the desired result... This got a lot uglier than it was :D
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
# Create new column to group by
df["Node"] = df[["Node_1","Node_2"]].apply(lambda x: tuple(sorted(x)),axis=1)
# Create Mean_time column
df["Mean_time"] = df.groupby('Node').transform('mean')
# Drop duplicate rows and drop Node and Time columns
df = df.drop_duplicates("Node").drop(['Node','Time'],axis=1)
print(df)
Returns:
Node_1 Node_2 Mean_time
0 A B 4
3 B C 5
An alternative would be to use:
df = (df.groupby('Node', as_index=False)
.agg({'Node_1':lambda x: list(x)[0],
'Node_2':lambda x: list(x)[0],
'Time': np.mean})
.drop('Node',axis=1))