Related
I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1 column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
You can use .replace. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series, i.e. df["col1"].replace(di, inplace=True).
map can be much faster than replace
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
df['col1'].map(di).fillna(df['col1'])
as in #jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See #jpp answer (linked above) for more extensive benchmarks and discussion.
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then #DanAllan and #DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
Given map is faster than replace (#JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
Or do apply:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
You can update your mapping dictionary with missing pairs from the dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
Result:
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
Is there direct possibility to drop all columns before a matching string in a pandas Dataframe. For eg. if my column 8 contains a string 'Matched' I want to drop columns 0 to 7 ?
Well, you did not give any information where and how to look for 'Matched', but let's say that integer col_num contains the number of the matched column:
col_num = np.where(df == 'Matched')[1][0]
df.drop(columns=df.columns[0:col_num],inplace=True)
will do the drop
Example
data = {'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'Match1': {0: 4}, 'D': {0: 5}}
df = pd.DataFrame(data)
df
A B C Match1 D
0 1 2 3 4 5
Code
remove in front of first Match + # column : boolean indexing
df.loc[:, df.columns.str.startswith('Match').cumsum() > 0]
result
Match1 D
0 4 5
I am trying to filter my dataframe such that when I create a new columnoutput, it displays the "medium" rating. My dataframe has str values, so I convert them to numbers based on a ranking system I have and then I filter out the maximum and minimum rating per row.
I am running into this error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I've created a data frame that pulls str values from my csv file:
df = pdf.read_csv('csv path', usecols=['rating1','rating2','rating3'])
And my dataframe looks like this:
rating1 rating2 rating3
0 D D C
1 C B A
2 B B B
I need it to look like this
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3
I have a mapping dictionary that converts the values to numbers.
ranking = {
'D': 1, 'C':2, 'B': 3, 'A' : 4
}
Below you can find the code I use to determine the "medium rating". Basically, if all the ratings are the same, you can pull the minimum rating. If two of the ratings are the same, pull in the lowest rating. If the three ratings differ, filter out the max rating and the min rating.
if df == df.loc[(['rating1'] == df['rating2'] & df['rating1'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
elif df == df.loc[(['rating1'] == df['rating2'] | df['rating1'] == df['rating3'] | df['rating2'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
else:
df['mediumrating'] == df.loc[(df.replace(ranking) > df.replace(ranking).min(axis=1) & df.replace(ranking)
Any help on my formatting or process would be welcomed!!
Use np.where:
For the condition, use df.nunique applied to axis=1 and check if the result equals either 1 (all values are the same) or 2 (two different values) with Series.isin.
If True, we need df.min along axis=1.
If False (all unique values), we need df.median along axis=1.
Finally, use astype to turn resulting floats into integers.
import pandas as pd
import numpy as np
data = {'rating1': {0: 'D', 1: 'C', 2: 'B'},
'rating2': {0: 'D', 1: 'B', 2: 'B'},
'rating3': {0: 'C', 1: 'A', 2: 'B'}}
df = pd.DataFrame(data)
ranking = {'D': 1, 'C':2, 'B': 3, 'A' : 4}
df['mediumrating'] = np.where(df.replace(ranking).nunique(axis=1).isin([1,2]),
df.replace(ranking).min(axis=1),
df.replace(ranking).median(axis=1)).astype(int)
print(df)
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3
Took to sec to understand what you really meant by filter. Here is some code that should be self explanatory and should achieve what you're looking for:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['D', 'D', 'C'], ['C', 'B', 'A'], ['B', 'B', 'B']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['rating1', 'rating2', 'rating3'])
# dictionary that maps the rating to a number
rating_map = {'D': 1, 'C': 2, 'B': 3, 'A': 4}
def rating_to_number(rating1, rating2, rating3):
if rating1 == rating2 and rating2 == rating3:
return rating_map[rating1]
elif rating1 == rating2 or rating1 == rating3 or rating2 == rating3:
return min(rating_map[rating1], rating_map[rating2], rating_map[rating3])
else:
return rating_map[sorted([rating1, rating2, rating3])[1]]
# create a new column based on the values of the other columns such that the new column has the value of therating_to_number function applied to the other columns
df['mediumrating'] = df.apply(lambda x: rating_to_number(x['rating1'], x['rating2'], x['rating3']), axis=1)
print(df)
This prints out:
rating1 rating2 rating3 mediumrating
0 D D C 2
1 C B A 3
2 B B B 3
Edit: updated rating_to_number based on your updated question
I have below data frame with 5 columns, I need to check specific string("-") in all columns and add precedent value in new column(F) if "-" is found. for example, "-" is located in Column B row zero and two; hence, 'a' and 'c'[precedent Column value] are added in Column(F) in related rows and so on.
Source Data Frame:
Desired Data Frame would be:
I have written below codes but get value length error when I want to create new Column(F), appreciate your support.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
'B': {0: '-', 1: 'a', 2: '-', 3: 'b', 4: 'd'}})
df['C'] = np.where(df['B'].isin(df['A'].values), df['B'], np.nan)
df['C'] = df['C'].map(dict(zip(df.A.values, df.B.values)))
df['D'] = np.where(df['C'].isin(df['B'].values), df['C'], np.nan)
df['D'] = df['D'].map(dict(zip(df.B.values, df['C'].values)))
df['E'] = np.where(df['D'].isin(df['C'].values), df['D'], np.nan)
df['E'] = df['E'].map(dict(zip(df['C'].values, df['D'].values)))
a=np.array(df.iloc[:,:5])
g=[]
for index,x in np.ndenumerate(a):
temp=[]
if x=="-":
temp.append(x-1)
g.append(temp)
df['F']=g
print(df)
Replace misisng values to all columns by DataFrame.where exclude previous values by - compared by DataFrame.shifted values, then back filling missing values and select first column by position:
df['F'] = df.where(df.shift(-1, axis=1).eq('-')).bfill(axis=1).iloc[:, 0]
print (df)
A B F
0 a - a
1 b a NaN
2 c - c
3 d b NaN
4 e d NaN
You can do:
df['F']=[i[0][-1] if len(i)>1 else np.nan for i in df.fillna('').sum(axis=1).str.split('-') ]
output:
df['F']
Out[41]:
0 a
1 a
2 c
3 a
4 a
Name: F, dtype: object
List Comprehension Explanation:
fill the NAs in df with '' and sum it across rows
split the sum with -
select the first element after spliting with - if length is > 1, else - wont be present hence fill with np.nan
select the last element of the splitted data by using [-1]
I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1 column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
You can use .replace. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series, i.e. df["col1"].replace(di, inplace=True).
map can be much faster than replace
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
df['col1'].map(di).fillna(df['col1'])
as in #jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See #jpp answer (linked above) for more extensive benchmarks and discussion.
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then #DanAllan and #DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
Given map is faster than replace (#JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
Or do apply:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
You can update your mapping dictionary with missing pairs from the dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
Result:
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)