Replacing negative values in specific columns of a dataframe - python

This is driving me crazy! I want to replace all negative values in columns containing string "_p" with the value multiplied by -0.5. Here is the code, where Tdf is a dataframe.
L=list(Tdf.filter(regex='_p').columns)
Tdf[L]=Tdf[L].astype(float)
Tdf[Tdf[L]<0]= Tdf[Tdf[L]<0]*-.5
I get the following error:
"TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value"
I variefied that all columns in Tdf[L] are type float64.
Even more confusing is that when I run a code, essentially the same except looping through multiple dataframes, it works:
csv_names=['Full','Missing','MinusMissing']
for DF in csv_names:
L=list(vars()[DF].iloc[:,1:])
vars()[DF][L]=vars()[DF][L].astype(float)
vars()[DF][vars()[DF][L]<0]= vars()[DF][vars()[DF][L]<0]*-.5
What am I missing?

Please clarify your question. If your question is about the error,
Tdf[Tdf[L]<0]= Tdf[Tdf[L]<0]*-.5
likely fails to non np.nan null values, as the error describes.
If your question is instead:"How do I multiply negative values by -0.5 in columns with "_p" in the column name?"
for col in Tdf.filter(regex='_p').columns.tolist():
Tdf[col] = Tdf.apply((lambda Tdf: Tdf[col]*-.5 if Tdf[col] < 0 else Tdf[col], axis =1)

Updated: You just need to filter out the string column which is type object. And then work on the data that is left. You can disable the slice warning if you want.
import pandas as pd
import numpy as np
Tdf = pd.DataFrame(columns=["name", "a_p", "b_p", "c"],
data=[["a", -1, -2, -3],
["b", 1, -2, 3],
["c", 1, np.NaN, 3]])
# get only non object columns
sub_Tdf = Tdf.select_dtypes(exclude='object')
# then work on slice
L = list(sub_Tdf.filter(regex='_p').columns)
sub_Tdf[L] = sub_Tdf[L].astype(float)
sub_Tdf[sub_Tdf[L] < 0] = sub_Tdf[sub_Tdf[L] < 0] * -.5
# see results were applied correctly
print(Tdf)
Output:
name a_p b_p c
0 a -1 -2.0 -3
1 b 1 -2.0 3
2 c 1 NaN 3
This will trigger the setting value on slice warning. You can disable it with.
import pandas as pd
pd.options.mode.chained_assignment = None

Related

Pandas Styler conditional formatting ( red highlight) on last two rows of a dataframe based off column value [duplicate]

I've been trying to print out a Pandas dataframe to html and have specific entire rows highlighted if the value of one specific column's value for that row is over a threshold. I've looked through the Pandas Styler Slicing and tried to vary the highlight_max function for such a use, but seem to be failing miserably; if I try, say, to replace the is_max with a check for whether a given row's value is above said threshold (e.g., something like
is_x = df['column_name'] >= threshold
), it isn't apparent how to properly pass such a thing or what to return.
I've also tried to simply define it elsewhere using df.loc, but that hasn't worked too well either.
Another concern also came up: If I drop that column (currently the criterion) afterwards, will the styling still hold? I am wondering if a df.loc would prevent such a thing from being a problem.
This solution allows for you to pass a column label or a list of column labels to highlight the entire row if that value in the column(s) exceeds the threshold.
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_greaterthan(s, threshold, column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] >= threshold
return ['background-color: yellow' if is_max.any() else '' for v in is_max]
df.style.apply(highlight_greaterthan, threshold=1.0, column=['C', 'B'], axis=1)
Output:
Or for one column
df.style.apply(highlight_greaterthan, threshold=1.0, column='E', axis=1)
Here is a simpler approach:
Assume you have a 100 x 10 dataframe, df. Also assume you want to highlight all the rows corresponding to a column, say "duration", greater than 5.
You first need to define a function that highlights the cells. The real trick is that you need to return a row, not a single cell. For example:
def highlight(s):
if s.duration > 5:
return ['background-color: yellow'] * len(s)
else:
return ['background-color: white'] * len(s)
**Note that the return part should be a list of 10 (corresponding to the number of columns). This is the key part.
Now you can apply this to the dataframe style as:
df.style.apply(highlight, axis=1)
Assume you have the following dataframe and you want to highlight the rows where id is greater than 3 to red
id char date
0 0 s 2022-01-01
1 1 t 2022-02-01
2 2 y 2022-03-01
3 3 l 2022-04-01
4 4 e 2022-05-01
5 5 r 2022-06-01
You can try Styler.set_properties with pandas.IndexSlice
# Subset your original dataframe with condition
df_ = df[df['id'].gt(3)]
# Pass the subset dataframe index and column to pd.IndexSlice
slice_ = pd.IndexSlice[df_.index, df_.columns]
s = df.style.set_properties(**{'background-color': 'red'}, subset=slice_)
s.to_html('test.html')
You can also try Styler.apply with axis=None which passes the whole dataframe.
def styler(df):
color = 'background-color: {}'.format
mask = pd.concat([df['id'].gt(3)] * df.shape[1], axis=1)
style = np.where(mask, color('red'), color('green'))
return style
s = df.style.apply(styler, axis=None)

Assigning new column based on other columns in Python

In Python I am trying to create a new column(degree) within a dataframe and to set its value based on if logic based on two other columns in the dataframe (whether single rows of one or both these columns are null values or not..). Per row it should assign to the new column the value of either one of these columns based on the presence of null values in the column.
I have tried the below code, which gives me the following error message:
KeyError: 'degree'
The code is -
for i in basicdataframe.index:
if pd.isnull(basicdataframe['section_degree'][i]) and pd.isnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
elif pd.notnull(basicdataframe['section_degree'][i]) and pd.isnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['section_degree'][i]
elif pd.isnull(basicdataframe['section_degree'][i]) and pd.notnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
elif pd.notnull(basicdataframe['section_degree'][i]) and pd.notnull(basicdataframe['model_degree'][i]):
basicdataframe['degree'][i] = basicdataframe['model_degree'][i]
Does anybody know how to achieve this?
Let's say you have pandas Dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"section_degree": [1, 2, np.nan, np.nan],
"model_degree": [np.nan, np.nan, np.nan, 3]
})
You can define function that will be applied to DataFrame:
def define_degree(x):
if pd.isnull(x["section_degree"]) and pd.isnull(x["model_degree"]):
return x["model_degree"]
elif pd.notnull(x['section_degree']) and pd.isnull(x['model_degree']):
return x["section_degree"]
elif pd.isnull(x['section_degree']) and pd.notnull(x['model_degree']):
return x["model_degree"]
elif pd.notnull(x['section_degree']) and pd.notnull(x['model_degree']):
return x["model_degree"]
df["degree"] = df.apply(define_degree, axis=1)
df
# output
section_degree model_degree degree
0 1.0 NaN 1.0
1 2.0 NaN 2.0
2 NaN NaN NaN
3 NaN 3.0 3.0
The error is because you are trying to assign values inside a column which does not exist yet.
Since you are setting a new column as degree, it makes sense if you add the column first with some default value.
basicdataframe['degree'] = ''
This would set an empty string for all rows of the dataframe for this column.
After that, you can set the values.
P.S. Your code is likely to give you warnings about
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
To fix that, you could take help from https://stackoverflow.com/a/20627316/1388513

Change multiple columns in a DataFrame

I am a beginner in Python and made my first venture into Pandas today. What I want to do is to convert several columns from string to float. Here's a quick example:
import numpy as np
import pandas as pd
def convert(str):
try:
return float(str.replace(',', ''))
except:
return None
df = pd.DataFrame([
['A', '1,234', '456,789'],
['B', '1' , '---' ]
], columns=['Company Name', 'X', 'Y'])
I want to convert X and Y to float. The reality has more columns and I don't always know the column names for X and Y so I must use integer indexing.
This works:
df.iloc[:, 1] = df.iloc[:, 1].apply(convert)
df.iloc[:, 2] = df.iloc[:, 2].apply(convert)
This doesn't:
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(convert)
# Error: could not broadcast input array from shape (2) into shape (2,1)
Is there anyway to apply the convert function on multiple columns at once?
There are several issues with your logic:
The slice 1:2 excludes 2, consistent with list slicing or slice object syntax. Use 1:3 instead.
Applying an element-wise function to a series via pd.Series.apply works. To apply an element-wise function to a dataframe, you need pd.DataFrame.applymap.
Never shadow built-ins: use mystr or x instead of str as a variable or argument name.
When you use a try / except construct, you should generally specify error type(s), in this case ValueError.
Therefore, this is one solution:
def convert(x):
try:
return float(x.replace(',', ''))
except ValueError:
return None
df.iloc[:, 1:3] = df.iloc[:, 1:3].applymap(convert)
print(df)
Company Name X Y
0 A 1234 456789
1 B 1 NaN
However, your logic is inefficient: you should look to leverage column-wise operations wherever possible. This can be achieved via pd.DataFrame.apply, along with pd.to_numeric applied to each series:
def convert_series(x):
return pd.to_numeric(x.str.replace(',', ''), errors='coerce')
df.iloc[:, 1:3] = df.iloc[:, 1:3].apply(convert_series)
print(df)
Company Name X Y
0 A 1234 456789
1 B 1 NaN

retrieve all values in column exceeding 5 strings - pandas dataframe

I have a column in dataframe - df where all values should be length of 5 strings/characters but due to an error in my code, some have erroneous values and length of strings is either below 5 or greater than 5. Is there a way to just retrieve these columns?
For your next question, please provide an example df and an expected output.
df = pd.DataFrame({'a' : [1, 2, 3], 'b' : ["jasdjdj", "abcde", "hmmamamam"]})
df[df.b.str.len() != 5]
#gives:
a b
0 1 jasdjdj
2 3 hmmamamam
How does this work for you? This will return a dataframe where values meet the condition.
new_DF= your_df[your_df['COLUMN TO CHECK HERE'].str.len() != 5]
print(new_DF)
I think you're looking for a simple masking operation:
filter = lambda string: len(string) == 5
mask = df[col_to_filter].apply(filter, 1) # Return a boolean vector
new_df = df[mask].copy() # Create a new dataframe
You can apply an opposite filter to find items that aren't length 5 on your original dataframe.
For more details on df.apply() look here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Python Pandas: Resolving "List Object has no Attribute 'Loc'"

I import a CSV as a DataFrame using:
import numpy as np
import pandas as pd
df = pd.read_csv("test.csv")
Then I'm trying to do a simple replace based on IDs:
df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson'
I get the following error:
AttributeError: 'list' object has no attribute 'loc'
Note, when I do print pd.version() I get 0.12.0, so it's not a problem (at least as far as I understand) with having pre-11 version. Any ideas?
To pickup from the comment: "I was doing this:"
df = [df.hc== 2]
What you create there is a "mask": an array with booleans that says which part of the index fulfilled your condition.
To filter your dataframe on your condition you want to do this:
df = df[df.hc == 2]
A bit more explicit is this:
mask = df.hc == 2
df = df[mask]
If you want to keep the entire dataframe and only want to replace specific values, there are methods such replace: Python pandas equivalent for replace. Also another (performance wise great) method would be creating a separate DataFrame with the from/to values as column and using pd.merge to combine it into the existing DataFrame. And using your index to set values is also possible:
df[mask]['fname'] = 'Johnson'
But for a larger set of replaces you would want to use one of the two other methods or use "apply" with a lambda function (for value transformations). Last but not least: you can use .fillna('bla') to rapidly fill up NA values.
The traceback indicates to you that df is a list and not a DataFrame as expected in your line of code.
It means that between df = pd.read_csv("test.csv") and df.loc[df.ID == 103, ['fname', 'lname']] = 'Michael', 'Johnson' you have other lines of codes that assigns a list object to df. Review that piece of code to find your bug
#Boud answer is correct. Loc assignment works fine if the right-hand-side list matches the number of replacing elements
In [56]: df = DataFrame(dict(A =[1,2,3], B = [4,5,6], C = [7,8,9]))
In [57]: df
Out[57]:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
In [58]: df.loc[1,['A','B']] = -1,-2
In [59]: df
Out[59]:
A B C
0 1 4 7
1 -1 -2 8
2 3 6 9

Categories