Remove substring from multiple string columns in a pandas DataFrame

Remove substring from multiple string columns in a pandas DataFrame - python

I have a list of columns in a dataframe that I want to run through and perform an operation on them. the columns hold datetimes or nothing.
For each column in the list, I would like to trim every value in the column that contains "20" in it to the first 10 characters, otherwise leave it as is.
I've tried this a few ways, but get a variety of errors or imperfect results.
The following version throws an error of " 'str' object has no attribute 'apply'", but if I don't use ".astype(str)", then I get an error of " argument of type 'datetime.datetime' is not iterable".
df_combined[dateColumns] = df_combined[dateColumns].fillna(notFoundText).astype(str)
print (dateColumns)
for column in dateColumns:
for row in range(len(column)):
print(df_combined[column][row])
if "20" in (df_combined[column][row]):
df_combined[column][row].apply(lambda x: x[:10], axis=1)
print(df_combined[column][row])
Halp. Thanks in advance.

Loops are considered an abomination in pandas. I'd recommend just doing something like this, with str.contains + np.where.
for c in df.columns:
# df[c] = df[c].astype(str) # uncomment this if your columns aren't dtype=str
df[c] = np.where(df[c].str.contains("20"), df[c].str[:10], df[c])

IIUC:
You want to do this over the entire dataframe.
If so, here is a vectorized way using numpy over the entire dataframe at once.
Setup
df = pd.DataFrame([
['xxxxxxxx20yyyy', 'z' * 14, 'wwwwwwww20vvvv'],
['k' * 14, 'dddddddd20ffff', 'a' * 14]
], columns=list('ABC'))
df
A B C
0 xxxxxxxx20yyyy zzzzzzzzzzzzzz wwwwwwww20vvvv
1 kkkkkkkkkkkkkk dddddddd20ffff aaaaaaaaaaaaaa
Solution
Using numpy.core.defchararray.find and np.where
from numpy.core.defchararray import find
v = df.values.astype(str)
i, j = np.where(find(v, '20') > -1)
v[i, j] = v[i, j].astype('<U10')
df.loc[:] = v
df
A B C
0 xxxxxxxx20 zzzzzzzzzzzzzz wwwwwwww20
1 kkkkkkkkkkkkkk dddddddd20 aaaaaaaaaaaaaa
If you don't want to overwrite the old dataframe, you can create a new one:
pd.DataFrame(v, df.index, df.columns)
A B C
0 xxxxxxxx20 zzzzzzzzzzzzzz wwwwwwww20
1 kkkkkkkkkkkkkk dddddddd20 aaaaaaaaaaaaaa

Related

Pandas Styler conditional formatting ( red highlight) on last two rows of a dataframe based off column value [duplicate]

I've been trying to print out a Pandas dataframe to html and have specific entire rows highlighted if the value of one specific column's value for that row is over a threshold. I've looked through the Pandas Styler Slicing and tried to vary the highlight_max function for such a use, but seem to be failing miserably; if I try, say, to replace the is_max with a check for whether a given row's value is above said threshold (e.g., something like
is_x = df['column_name'] >= threshold
), it isn't apparent how to properly pass such a thing or what to return.
I've also tried to simply define it elsewhere using df.loc, but that hasn't worked too well either.
Another concern also came up: If I drop that column (currently the criterion) afterwards, will the styling still hold? I am wondering if a df.loc would prevent such a thing from being a problem.

This solution allows for you to pass a column label or a list of column labels to highlight the entire row if that value in the column(s) exceeds the threshold.
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_greaterthan(s, threshold, column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] >= threshold
return ['background-color: yellow' if is_max.any() else '' for v in is_max]
df.style.apply(highlight_greaterthan, threshold=1.0, column=['C', 'B'], axis=1)
Output:
Or for one column
df.style.apply(highlight_greaterthan, threshold=1.0, column='E', axis=1)

Here is a simpler approach:
Assume you have a 100 x 10 dataframe, df. Also assume you want to highlight all the rows corresponding to a column, say "duration", greater than 5.
You first need to define a function that highlights the cells. The real trick is that you need to return a row, not a single cell. For example:
def highlight(s):
if s.duration > 5:
return ['background-color: yellow'] * len(s)
else:
return ['background-color: white'] * len(s)
**Note that the return part should be a list of 10 (corresponding to the number of columns). This is the key part.
Now you can apply this to the dataframe style as:
df.style.apply(highlight, axis=1)

Assume you have the following dataframe and you want to highlight the rows where id is greater than 3 to red
id char date
0 0 s 2022-01-01
1 1 t 2022-02-01
2 2 y 2022-03-01
3 3 l 2022-04-01
4 4 e 2022-05-01
5 5 r 2022-06-01
You can try Styler.set_properties with pandas.IndexSlice
# Subset your original dataframe with condition
df_ = df[df['id'].gt(3)]
# Pass the subset dataframe index and column to pd.IndexSlice
slice_ = pd.IndexSlice[df_.index, df_.columns]
s = df.style.set_properties(**{'background-color': 'red'}, subset=slice_)
s.to_html('test.html')
You can also try Styler.apply with axis=None which passes the whole dataframe.
def styler(df):
color = 'background-color: {}'.format
mask = pd.concat([df['id'].gt(3)] * df.shape[1], axis=1)
style = np.where(mask, color('red'), color('green'))
return style
s = df.style.apply(styler, axis=None)

Add character to column based on text condition using pandas

I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!

You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11

you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object

How to drop Ith row of a data frame

How do I drop row number i of a DF ?
I did the thing below but it is not working.
DF = DF.drop(i)
So I wonder what I miss there.

You must pass a label to drop. Here drop tries to use i as a label and fails (ith KeyError) as your index probably has other values. Worse, if the index was composed of integers in random order you might drop an incorrect row without noticing it.
Use:
df.drop(df.index[i])
Example:
df = pd.DataFrame({'col': range(4)}, index=list('ABCD'))
out = df.drop(df.index[2])
output:
col
A 0
B 1
D 3
pitfall
In case of duplicated indices, you might remove unwanted rows!
df = pd.DataFrame({'col': range(4)}, index=list('ABAD'))
out = df.drop(df.index[2])
output (A is incorrectly dropped!):
col
B 1
D 3
workaround:
import numpy as np
out = df[np.arange(len(df)) != i]
drop several indices by position:
import numpy as np
out = df[~np.isin(np.arange(len(df)), [i, j])]

You need to add square brackets:
df = df.drop([i])

Try This:
df.drop(df.index[i])

Pandas: Creating a list based on the differences between 2 series

I am writing a custom error message when 2 Pandas series are not equal and want to use '<' to point at the differences.
Here's the workflow for a failed equality:
Convert both lists to Python: pd.Series([list])
Side by side comparison in a dataframe: table = pd.concat([list1], [list2]), axis=1
Add column and index names: table.columns = ['...', '...'], table.index = ['...', '...']
Current output:
|Yours|Actual|
|1|1|
|2|2|
|4|3|
Desired output:
|Yours|Actual|-|
|1|1||
|2|2||
|4|3|<|
The naive solution is iterating through each list index and if it's not equal, appending '<' to another list then putting this list into pd.concat() but I am looking for a method using Pandas. For example,
error_series = '<' if (abs(yours - actual) >= 1).all(axis=None) else ''
Ideally it would append '<' to a list if the difference between the results is greater than the Margin of Error of 1, otherwise append nothing
Note: Removed tables due to StackOverflow being picky and not letting my post my question

You can create the DF and give index and column names in one line:
import pandas as pd
list1 = [1,2,4]
list2 = [1,2,10]
df = pd.DataFrame(zip(list1, list2), columns=['Yours', 'Actual'])
Create a boolean mask to find the rows that have a too large difference:
margin_of_error = 1
mask = df.diff(axis=1)['Actual'].abs()>margin_of_error
Add a column to the DF and set the values of the mask as you want:
df['too_different'] = df.diff(axis=1)['Actual'].abs()>margin_of_error
df['too_different'].replace(True, '<', inplace=True)
df['too_different'].replace(False, '', inplace=True)
output:
Yours Actual too_different
0 1 1
1 2 2
2 4 10 <

or you can do something like this:
df = df.assign(diffr=df.apply(lambda x: '<'
if (abs(x['yours'] - x['actual']) >= 1)
else '', axis=1))
print(df)
'''
yours actual diffr
0 1 1
1 2 2
2 4 3 <

Python Pandas: Resolving "List Object has no Attribute 'Loc'"

I import a CSV as a DataFrame using:
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
Then I'm trying to do a simple replace based on IDs:
df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson'
I get the following error:
AttributeError: 'list' object has no attribute 'loc'
Note, when I do print pd.version() I get 0.12.0, so it's not a problem (at least as far as I understand) with having pre-11 version. Any ideas?

To pickup from the comment: "I was doing this:"
df = [df.hc== 2]
What you create there is a "mask": an array with booleans that says which part of the index fulfilled your condition.
To filter your dataframe on your condition you want to do this:
df = df[df.hc == 2]
A bit more explicit is this:
mask = df.hc == 2
df = df[mask]
If you want to keep the entire dataframe and only want to replace specific values, there are methods such replace: Python pandas equivalent for replace. Also another (performance wise great) method would be creating a separate DataFrame with the from/to values as column and using pd.merge to combine it into the existing DataFrame. And using your index to set values is also possible:
df[mask]['fname'] = 'Johnson'
But for a larger set of replaces you would want to use one of the two other methods or use "apply" with a lambda function (for value transformations). Last but not least: you can use .fillna('bla') to rapidly fill up NA values.

The traceback indicates to you that df is a list and not a DataFrame as expected in your line of code.
It means that between df = pd.read_csv("test.csv") and df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson' you have other lines of codes that assigns a list object to df. Review that piece of code to find your bug

#Boud answer is correct. Loc assignment works fine if the right-hand-side list matches the number of replacing elements
In [56]: df = DataFrame(dict(A =[1,2,3], B = [4,5,6], C = [7,8,9]))
In [57]: df
Out[57]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [58]: df.loc[1,['A','B']] = -1,-2
In [59]: df
Out[59]:
A B C
0 1 4 7
1 -1 -2 8
2 3 6 9

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove substring from multiple string columns in a pandas DataFrame - python

Loops are considered an abomination in pandas. I'd recommend just doing something like this, with str.contains + np.where. for c in df.columns: # df[c] = df[c].astype(str) # uncomment this if your columns aren't dtype=str df[c] = np.where(df[c].str.contains("20"), df[c].str[:10], df[c])

Related

Pandas Styler conditional formatting ( red highlight) on last two rows of a dataframe based off column value [duplicate]

Add character to column based on text condition using pandas

How to drop Ith row of a data frame

Pandas: Creating a list based on the differences between 2 series

Python Pandas: Resolving "List Object has no Attribute 'Loc'"

Categories

Resources