Pandas - Updating value in row based on condition with .apply() - python

Simplified, I would like to use a function to check a condition on a row and set the value accordingly:
def helper(row):
if row["A"] == "TEST":
row["B"] = "WOW"
else:
row["C"] = "NO_GO"
moddf = df.apply(helper, axis=1)
I can do this using iterrows() but .apply should be MUCH faster to iterate over 1M rows in a df.

You don't need (and shouldn't use) apply:
# toy data
df = pd.DataFrame({'A':['TEST','NO'],
'B' : ['A','B'],
'C' :list('12')})
s = df['A']=='TEST'
df.loc[s,'B'] = 'WOW'
df.loc[~s, 'C'] = 'NO_GO'
Output:
A B C
0 TEST WOW 1
1 NO B NO_GO

Related

Python find and replace tool using pandas and a dictionary

Having issues with building a find and replace tool in python. Goal is to search a column in an excel file for a string and swap out every letter of the string based on the key value pair of the dictionary, then write the entire new string back to the same cell. So "ABC" should convert to "BCD". I have to find and replace any occurrence of individual characters.
The below code runs without debugging, but newvalue never creates and I don't know why. No issues writing data to the cell if newvalue gets created.
input: df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
expected output: df = pd.DataFrame({'Code1': ['BCD1', 'C5DE', 'D3EF']})
mycolumns = ["Col1", "Col2"]
mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
for x in mycolumns:
# 1. If the mycolumn value exists in the headerlist of the file
if x in headerlist:
# 2. Get column coordinate
col = df.columns.get_loc(x) + 1
# 3. iterate through the rows underneath that header
for ind in df.index:
# 4. log the row coordinate
rangerow = ind + 2
# 5. get the original value of that coordinate
oldval = df[x][ind]
for count, y in enumerate(oldval):
# 6. generate replacement value
newval = df.replace({y: mydictionary}, inplace=True, regex=True, value=None)
print("old: " + str(oldval) + " new: " + str(newval))
# 7. update the cell
ws.cell(row=rangerow, column=col).value = newval
else:
print("not in the string")
else:
# print(df)
print("column doesn't exist in workbook, moving on")
else:
print("done")
wb.save(filepath)
wb.close()
I know there's something going on with enumerate and I'm probably not stitching the string back together after I do replacements? Or maybe a dictionary is the wrong solution to what I am trying to do, the key:value pair is what led me to use it. I have a little programming background but ery little with python. Appreciate any help.
newvalue never creates and I don't know why.
DataFrame.replace with inplace=True will return None.
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> df = df.replace('ABC1','999')
>>> df
Code1
0 999
1 B5CD
2 C3DE
>>> q = df.replace('999','zzz', inplace=True)
>>> print(q)
None
>>> df
Code1
0 zzz
1 B5CD
2 C3DE
>>>
An alternative could b to use str.translate on the column (using its str attribute) to encode the entire Series
>>> df = pd.DataFrame({'Code1': ['ABC1', 'B5CD', 'C3DE']})
>>> mydictionary = {'A': 'B', 'B': 'C', 'C': 'D'}
>>> table = str.maketrans('ABC','BCD')
>>> df
Code1
0 ABC1
1 B5CD
2 C3DE
>>> df.Code1.str.translate(table)
0 BCD1
1 C5DD
2 D3DE
Name: Code1, dtype: object
>>>

Filtering out a row value based on column value using Pandas

How to filter out a row value in a column B, if another column C has a specific text say "ABC" ? in this case "google.com" would be filtered out.
A B C D
0 True facebook.com kxy 19999
1 True google.com ABC 21212
2 False yahoo.com PoP 3213231
Everytime there is "ABC" in Col C, row value from col B should be appended in a list.
pseudocode:
dataset = pd.read_csv('xyz.csv')
path = []
for value in dataset.C:
if dataset['C'] == 'abc':
#append path with row value of Col B
else:
#not append path
path = dataset.loc[dataset.C == 'ABC', 'B'].tolist()
will give you the desired list in one go.
as an alternative you can use where and list:
path = list(data.B.where(data.C == 'ABC').dropna())
print(path)
# ['google.com']
inds will be a pandas series with boolean values, indicating whether a row's value in column 'C' is equal to 'ABC'.
Once we know that, we can subset dataset and take the values of column 'B':
inds = dataset['C'] == 'ABC'
list(dataset.loc[inds, 'B'])
Direct asnwer:
filtered_values = dataset.loc[dataset["C"]=='ABC']['B'].tolist()
For understanding purposes:
First get the rows where C="ABC"
filtered_rows = dataset.loc[dataset["C"]=='ABC']
filtered_rows
Output:
A B C D
1 True google.com ABC 21212
From these rows get the values of only column B and convert this Series into a list with .tolist() function
filtered_values = filtered_rows["B"].tolist()
filtered_values
Output:
['google.com']

Pandas use cell value as dict key to return dict value

my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)

Change entire pandas Series based on conditions

In my pandas DataFrame I want to add a new column (NewCol), based on some conditions that follow from data of another column (OldCol).
To be more specific, my column OldCol contains three types of strings:
BB_sometext
sometext1
sometext 1
I want to differentiate between these three types of strings. Right now, I did this using the following code:
df['NewCol'] = pd.Series()
for i in range(0, len(df)):
if str(df.loc[i, 'OldCol']).split('_')[0] == "BB":
df.loc[i, 'NewCol'] = "A"
elif len(str(df.loc[i, 'OldCol']).split(' ')) == 1:
df.loc[i, 'NewCol'] = "B"
else:
df.loc[i, 'NewCol'] = "C"
Even though this code seems to work, I'm sure there is a better way to do something like this, as this seems very inefficient. Does anyone know a better way to do this? Thanks in advance.
In general, you need something like the following formulation:
>>> df.loc[boolean_test, 'NewCol'] = desired_result
Or, for multiple conditions (Note the parentheses around each condition, and the rather unpythonic & instead of and):
>>> df.loc[(boolean_test1) & (boolean_test2), 'NewCol'] = desired_result
Example
Let's start with an example Data.Frame:
>>> df = pd.DataFrame(dict(OldCol=['sometext1', 'sometext 1', 'BB_ccc', 'sometext1']))
Then you'd do:
>>> df.loc[df['OldCol'].str.split('_').str[0] == 'BB', 'NewCol'] = "A"
To set all BB_ columns to A. You could even (optionally, for readability) separate out the boolean condition onto its own line:
>>> oldcol_starts_BB = df['OldCol'].str.split('_').str[0] == 'BB'
>>> df.loc[oldcol_starts_BB, 'NewCol'] = "A"
I like this method become it means the reader doesn't have to work out the logic hidden within the split('_').str[0] part.
Then, to set all columns with no space, which are still not set (i.e. where isnull is true):
>>> oldcol_has_no_space = df['OldCol'].str.find(' ') < 0
>>> newcol_is_null = df['NewCol'].isnull()
>>> df.loc[(oldcol_has_no_space) & (newcol_is_null), 'NewCol'] = 'C'
Then finally, set all remaining values of NewCol to B:
>>> df.loc[df['NewCol'].isnull(), 'NewCol'] = 'B'
>>> df
OldCol NewCol
0 sometext1 C
1 sometext 1 B
2 BB_ccc A
3 sometext1 C

Filtering dataframes in pandas : use a list of conditions

I have a pandas dataframe with two dimensions : 'col1' and 'col2'
I can filter certain values of those two columns using :
df[ (df["col1"]=='foo') & (df["col2"]=='bar')]
Is there any way I can filter both columns at once ?
I tried naively to use the restriction of the dataframes to two columns, but my best guesses for the second part of the equality don't work :
df[df[["col1","col2"]]==['foo','bar']]
yields me this error
ValueError: Invalid broadcasting comparison [['foo', 'bar']] with block values
I need to do this because the names of the columns, but also the number of columns on which the condition will be set will vary
To the best of my knowledge, there is no way in Pandas for you to do what you want. However, although the following solution may not me the most pretty, you can zip a set of parallel lists as follows:
cols = ['col1', 'col2']
conditions = ['foo', 'bar']
df[eval(" & ".join(["(df['{0}'] == '{1}')".format(col, cond)
for col, cond in zip(cols, conditions)]))]
The string join results in the following:
>>> " & ".join(["(df['{0}'] == '{1}')".format(col, cond)
for col, cond in zip(cols, conditions)])
"(df['col1'] == 'foo') & (df['col2'] == 'bar')"
Which you then use eval to evaluate, effectively:
df[eval("(df['col1'] == 'foo') & (df['col2'] == 'bar')")]
For example:
df = pd.DataFrame({'col1': ['foo', 'bar, 'baz'], 'col2': ['bar', 'spam', 'ham']})
>>> df
col1 col2
0 foo bar
1 bar spam
2 baz ham
>>> df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond))
for col, cond in zip(cols, conditions)]))]
col1 col2
0 foo bar
I would like to point out an alternative for the accepted answer as eval is not necessary for solving this problem.
from functools import reduce
df = pd.DataFrame({'col1': ['foo', 'bar', 'baz'], 'col2': ['bar', 'spam', 'ham']})
cols = ['col1', 'col2']
values = ['foo', 'bar']
conditions = zip(cols, values)
def apply_conditions(df, conditions):
assert len(conditions) > 0
comps = [df[c] == v for c, v in conditions]
result = comps[0]
for comp in comps[1:]:
result &= comp
return result
def apply_conditions(df, conditions):
assert len(conditions) > 0
comps = [df[c] == v for c, v in conditions]
return reduce(lambda c1, c2: c1 & c2, comps[1:], comps[0])
df[apply_conditions(df, conditions)]
I know I'm late to the party on this one, but if you know that all of your values will use the same sign, then you could use functools.reduce. I have a CSV with something like 64 columns, and I have no desire whatsoever to copy and paste them. This is how I resolved:
from functools import reduce
players = pd.read_csv('players.csv')
# I only want players who have any of the outfield stats over 0.
# That means they have to be an outfielder.
column_named_outfield = lambda x: x.startswith('outfield')
# If a column name starts with outfield, then it is an outfield stat.
# So only include those columns
outfield_columns = filter(column_named_outfield, players.columns)
# Column must have a positive value
has_positive_value = lambda c:players[c] > 0
# We're looking to create a series of filters, so use "map"
list_of_positive_outfield_columns = map(has_positive_value, outfield_columns)
# Given two DF filters, this returns a third representing the "or" condition.
concat_or = lambda x, y: x | y
# Apply the filters through reduce to create a primary filter
is_outfielder_filter = reduce(concat_or, list_of_positive_outfield_columns)
outfielders = players[is_outfielder_filter]
Posting because I ran into a similar issue and found a solution that gets it done in one line albeit a bit inefficiently
cols, vals = ["col1","col2"],['foo','bar']
pd.concat([df.loc[df[cols[i]] == vals[i]] for i in range(len(cols))], join='inner')
This is effectively an & across the columns. To have an | across the columns you can ommit join='inner' and add a drop_duplicates() at the end

Categories