I have very inconsistent data in one of DataFrame columns:
col1
12.0
13,1
NaN
20.3
abc
"12,5"
200.9
I need to standardize these data and find a maximum value among numeric values, which should be less than 100.
This is my code:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if x.isdigit() else x)
num_temps = pd.to_numeric(df[col],errors='coerce')
temps = num_temps[num_temps<10]
print(temps.max())
It fails when, for example, x is float AttributeError: 'float' object has no attribute 'isdigit'.
Cast value to string by str(x), but then for test is necessary also replace . and , to empty value for use isdigit:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if str(x).replace(',', '').replace('.', '').isdigit() else x)
But here is possible cast values to strings and then use Series.str.replace:
num_temps = pd.to_numeric(df["col1"].astype(str).str.replace(',', '.'), errors='coerce')
print (df)
col1
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
temps = num_temps[num_temps<100]
print(temps.max())
20.3
Alternative:
def f(x):
try:
return float(str(x).replace(',','.'))
except ValueError:
return np.nan
num_temps = df["col1"].apply(f)
print (num_temps)
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
Name: col1, dtype: float64
This works:
df.replace(",", ".", regex=True).replace("[a-zA-Z]+", np.NaN, regex=True).dropna().max()
Related
I have a data frame which has "0's" and looks as below:
df = pd.DataFrame({
'WARNING':['4402,43527,0,7628,54337',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
I want to remove the zeroes from the strings where there are multiple values such that the results df looks like this:
df = pd.DataFrame({
'WARNING':['4402,43527,7628,54337',4402,0,0,'1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753','4572,8764,8753',9876,0,'4579,7514']
})
However the ones which have individual 0's in a cell should remain intact. How do I achieve this?
df = pd.DataFrame({
'WARNING':['0,0786,1230,01234,0',4402,0,0,'0,1234,56437,76252',0,3602],
'FAILED':[0,0,'5555,6753,0','4572,0,8764,8753',9876,0,'0,4579,7514']
})
df.apply(lambda x: x.str.strip('0,|,0')).replace(",0,", ",")
Output:
WARNING FAILED
0 786,1230,01234 NaN
1 NaN NaN
2 NaN 5555,6753
3 NaN 4572,0,8764,8753
4 1234,56437,76252 NaN
5 NaN NaN
6 NaN 4579,7514
I would solve it with a list comprehension.
In [1]: df.apply(lambda col: col.astype(str).apply(lambda x: ','.join([y for y in x.split(',') if y != '0']) if ',' in x else x), axis=0)
Out[1]:
FAILED WARNING
0 0 4402,43527,7628,54337
1 0 4402
2 5555,6753 0
3 4572,8764,8753 0
4 9876 1234,56437,76252
5 0 0
6 4579,7514 3602
Breaking it down:
Iterate over all columns with df.apply(lambda col: ..., axis=0)
Convert each column's values to string with col.astype(str)
Apply a function to each "cell" of col with .apply(lambda x: ...)
The lambda function first checks if ',' exists in x, otherwise returns the original value of x
If ',' in x, it splits x by ',', which creates a list of y's
It keeps only the y != '0'
It joins everything at the end with a ','.join(...)
You can use regex with a negative look behind to replace 0, only if it not preceded by another digit.
import re
df.applymap(lambda x: re.sub(r'(?<![0-9])0,', '', str(x)))
WARNING FAILED
0 4402,43527,7628,54337 0
1 4402 0
2 0 5555,6753,0
3 0 4572,8764,8753
4 1234,56437,76252 9876
5 0 0
6 3602 4579,7514
For the test case W-B points out:
s = '0,0999,9990,999'
re.sub(r'(?<![0-9])0,', '', s)
#'0999,9990,999'
How can I sum values in dataframe that a separated by semicolon?
Got:
col1 col2
2018-03-05 2.1 8
2018-03-06 8 3.1;2
2018-03-07 1;1 8;1
Need:
col1 col2
2018-03-05 2.1 8
2018-03-06 8 5.1
2018-03-07 2 9
You can use apply for processes each column with split, cast to float and sum per columns:
df = df.apply(lambda x: x.str.split(';', expand=True).astype(float).sum(axis=1))
Or process each value separately by applymap:
df = df.applymap(lambda x: sum(map(float, x.split(';'))))
print (df)
col1 col2
2018-03-05 2.1 8.0
2018-03-06 8.0 5.1
2018-03-07 2.0 9.0
EDIT:
If numeric with strings columns is possible use select_dtypes for exclude numeric and working only with strings columns with ;:
print (df)
col1 col2 col3
2018-03-05 2.1 8 1
2018-03-06 8 3.1;2 2
2018-03-07 1;1 8;1 8
cols = df.select_dtypes(exclude=np.number).columns
df[cols] = df[cols].apply(lambda x: x.str.split(';', expand=True).astype(float).sum(axis=1))
print (df)
col1 col2 col3
2018-03-05 2.1 8.0 1
2018-03-06 8.0 5.1 2
2018-03-07 2.0 9.0 8
You can use numpy.vectorize if performance is an issue:
res = pd.DataFrame(np.vectorize(lambda x: sum(map(float, x.split(';'))))(df.values),
columns=df.columns, index=df.index)
Performance benchmarking
def jpp(df):
res = pd.DataFrame(np.vectorize(lambda x: sum(map(float, x.split(';'))))(df.values),
columns=df.columns, index=df.index)
return res
def jez(df):
return df.applymap(lambda x: sum(map(float, x.split(';'))))
df = pd.concat([df]*1000)
%timeit jpp(df) # 11 ms per loop
%timeit jez(df) # 21.3 ms per loop
You can use:
df['col2'] = df.col2.map(lambda s: sum(float(e) for e in s.split(';')))
My code scrapes information from the website and puts it into a dataframe. But i'm not certain why the order of the code will give rise to the error: AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Basically, the data scraped has over 20 rows and 10 columns.
Some values are within brackets ie: (2,333) and I want to change it to: -2333.
Some values have words n.a and I want to change it to numpy.nan
some values are - and I want to change them to numpy.nan too.
Doesn't Work
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This produces the error - AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Works
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This doesn't give me any errors and returns me what I want.
Any thoughts on why this happens?
For me works double replace - first with regex=True for replace substrings and second for all values:
np.random.seed(23)
df = pd.DataFrame(np.random.choice(['(2,333)','n.a.','-',2.34], size=(3,3)),
columns=list('ABC'))
print (df)
A B C
0 2.34 - (2,333)
1 n.a. - (2,333)
2 2.34 n.a. (2,333)
df1 = df.replace(['\(','\)','\,'], ['-','',''], regex=True).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan).replace(['\(','\)','\,'], ['-','',''], regex=True)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
EDIT:
Your error means you want replace some non string column (e.g. all columns are NaNs in column B) by str.replace:
df1 = df.apply(lambda x: x.str.replace('\(','-').str.replace('\)','')
.str.replace(',','')).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan)
.apply(lambda x: x.str.replace('\(','-')
.str.replace('\)','')
.str.replace(',',''))
print(df1)
AttributeError: ('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index B')
dtype of column B is float64:
df1 = df.replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN (2,333)
1 NaN NaN (2,333)
2 2.34 NaN (2,333)
print (df1.dtypes)
A object
B float64
C object
dtype: object
How to apply a custom function to every element of every column if its the value is not null?
Lets say I have a data frame of 10 columns, out of which I want to apply a lower() function to every element of just 4 columns if pd.notnull(x), else just keep none as value.
I tried to use like this,
s.apply(lambda x: change_to_lowercase(x), axis = 1)
def change_to_lowercase(s):
s['A'] = s['A'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['B'] = s['B'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['C'] = s['C'].map(lambda x: x.lower() if pd.notnull(x) else x)
s['D'] = s['D'].map(lambda x: x.lower() if pd.notnull(x) else x)
return s
But since my columns are mixed datatype(which is NaN as float, rest as unicode). This is throwing me an error -
float has no attribute map.
How to get rid of this error?
I think you need DataFrame.applymap because working elementwise:
L = [[1.5, 'Test', np.nan, 2], ['Test', np.nan, 2,'TEST'], ['Test', np.nan,1.5, 2]]
df = pd.DataFrame(L, columns=list('abcd'))
print (df)
a b c d
0 1.5 Test NaN 2
1 Test NaN 2.0 TEST
2 Test NaN 1.5 2
cols = ['a','b']
#for python 2 change str to basestring
df[cols] = df[cols].applymap(lambda x: x.lower() if isinstance(x, str) else x)
print (df)
a b c d
0 1.5 test NaN 2
1 test NaN 2.0 TEST
2 test NaN 1.5 2
You are trying to map a Series and then in lambda you take the entire row.
You should also check for integers, floats etc that don't have a method .lower(). So the best is to check if it is a string, not just if it is not a notnull, in my opinion.
This works:
s = pd.DataFrame([{'A': 1.5, 'B':"Test", 'C': np.nan, 'D':2}])
s
A B C D
0 1.5 Test NaN 2
s1 = s.apply(lambda x: x[0].lower() if isinstance(x[0], basestring) else x[0]).copy()
s1
A 1.5
B test
C NaN
D 2
dtype: object
For python 3 to check if string isinstance(x[0], str)
To be able to select columns:
s1 = pd.DataFrame()
columns = ["A", "B"]
for column in columns:
s1[column] = s[column].apply(lambda x: x.lower() if isinstance(x, str) else x).copy()
s1
A B
0 1.5 test
I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]
Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0