My code scrapes information from the website and puts it into a dataframe. But i'm not certain why the order of the code will give rise to the error: AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Basically, the data scraped has over 20 rows and 10 columns.
Some values are within brackets ie: (2,333) and I want to change it to: -2333.
Some values have words n.a and I want to change it to numpy.nan
some values are - and I want to change them to numpy.nan too.
Doesn't Work
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This produces the error - AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Works
for final_df, engine_name in zip((df_foo, df_bar, df_far), (['engine_foo', 'engine_bar', 'engine_far'])):
# Replacing necessary items for final clean up
for i in final_df.columns:
final_df[i] = final_df[i].str.replace(')', '')
final_df[i] = final_df[i].str.replace(',', '')
final_df[i] = final_df[i].str.replace('(', '-')
final_df.replace('-', numpy.nan, inplace=True)
final_df.replace('n.a.', numpy.nan, inplace=True)
# Appending Code to dataframe
final_df = final_df.T
final_df.insert(loc=0, column='Code', value=some_code)
# This doesn't give me any errors and returns me what I want.
Any thoughts on why this happens?
For me works double replace - first with regex=True for replace substrings and second for all values:
np.random.seed(23)
df = pd.DataFrame(np.random.choice(['(2,333)','n.a.','-',2.34], size=(3,3)),
columns=list('ABC'))
print (df)
A B C
0 2.34 - (2,333)
1 n.a. - (2,333)
2 2.34 n.a. (2,333)
df1 = df.replace(['\(','\)','\,'], ['-','',''], regex=True).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan).replace(['\(','\)','\,'], ['-','',''], regex=True)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
EDIT:
Your error means you want replace some non string column (e.g. all columns are NaNs in column B) by str.replace:
df1 = df.apply(lambda x: x.str.replace('\(','-').str.replace('\)','')
.str.replace(',','')).replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN -2333
1 NaN NaN -2333
2 2.34 NaN -2333
df1 = df.replace(['-','n.a.'], np.nan)
.apply(lambda x: x.str.replace('\(','-')
.str.replace('\)','')
.str.replace(',',''))
print(df1)
AttributeError: ('Can only use .str accessor with string values, which use np.object_ dtype in pandas', 'occurred at index B')
dtype of column B is float64:
df1 = df.replace(['-','n.a.'], np.nan)
print(df1)
A B C
0 2.34 NaN (2,333)
1 NaN NaN (2,333)
2 2.34 NaN (2,333)
print (df1.dtypes)
A object
B float64
C object
dtype: object
Related
I am trying to split a column up into two columns based on a delimeter. The column presently has text that is separated by a '-'. Some of the values in the column are NaN, so when I run the code below, I get the following error message: ValueError: Columns must be same length as key.
I don't want to delete the NaN values, but am not sure how to skip them so that this splitting works.
The code I have right now is:
df[['A','B']] = df['A'].str.split('-',expand=True)
Your code works well with NaN values but you have to use n=1 as parameter of str.split:
Suppose this dataframe:
df = pd.DataFrame({'A': ['hello-world', np.nan, 'raise-an-exception']}
print(df)
# Output:
A
0 hello-world
1 NaN
2 raise-an-exception
Reproducible error:
df[['A', 'B']] = df['A'].str.split('-', expand=True)
print(df)
# Output:
...
ValueError: Columns must be same length as key
Use n=1:
df[['A', 'B']] = df['A'].str.split('-', n=1, expand=True)
print(df)
# Output:
A B
0 hello world
1 NaN NaN
2 raise an-exception
An alternative is to generate more columns:
df1 = df['A'].str.split('-', expand=True)
df1.columns = df1.columns.map(lambda x: chr(x+65))
print(df1)
# Output:
A B C
0 hello world None
1 NaN NaN NaN
2 raise an exception
Maybe filter them out with loc:
df.loc[df['A'].notna(), ['A','B']] = df.loc[df['A'].notna(), 'A'].str.split('-',expand=True)
I have a dataframe containing two columns; A_ID and R_ID.
I want to update R_ID to only contain values that are also in A_ID, the rest should be deleted (also NaN). The values should stay at the same position/index. I get that this is an inner join but with my proposed solution I got several problems.
Example:
import pandas as pd
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
I tried
df_A_ID = df[["A_ID"]]
df_R_ID = df[["R_ID"]]
new_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID', right_index=True)
and
new_df = pd.concat([dataset_A_ID, dataset_R_ID],join="inner")
But with the first option I get an "You are trying to merge on object and int64 columns" error, even though both columns are of d.types object and with the second one i get an empty DataFrame.
My expected output would be the same dataframe as before but with R_ID only containing values that are also in the column A_ID, at the same index/position:
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': [[np.nan],[np.nan],[np.nan],"1E4",]}
df = pd.DataFrame(data)
print(df)
Set NaN by Series.where if no match columns compared by Series.isin:
#solution working with scalar NaNs
data = {'A_ID': ['1E2', '1E3', '1E4', '1E5'], 'R_ID': ['1E7',np.nan,np.nan,"1E4",]}
df = pd.DataFrame(data)
print(df)
A_ID R_ID
0 1E2 1E7
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
df['R_ID'] = df['R_ID'].where(df["R_ID"].isin(df["A_ID"]))
print(df)
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
Or:
df.loc[~df["R_ID"].isin(df["A_ID"]), 'R_ID'] = np.nan
Use isin:
df['R_ID'] = df['R_ID'].loc[df['R_ID'].isin(df['A_ID'])]
>>> df
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 NaN
3 1E5 1E4
It should work
df_A_ID = df[["A_ID"]].astype(dtype=pd.StringDtype())
df_R_ID = df[["R_ID"]].astype(dtype=pd.StringDtype()).reset_index()
temp_df = pd.merge(df_A_ID, df_R_ID, how='inner', left_on='A_ID', right_on ='R_ID').set_index('index')
df.loc[~(df_R_ID.isin(temp_df[['R_ID']])['R_ID']).fillna(False),'R_ID'] = [np.nan]
Output
A_ID R_ID
0 1E2 NaN
1 1E3 NaN
2 1E4 1E4
3 1E5 NaN
I have very inconsistent data in one of DataFrame columns:
col1
12.0
13,1
NaN
20.3
abc
"12,5"
200.9
I need to standardize these data and find a maximum value among numeric values, which should be less than 100.
This is my code:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if x.isdigit() else x)
num_temps = pd.to_numeric(df[col],errors='coerce')
temps = num_temps[num_temps<10]
print(temps.max())
It fails when, for example, x is float AttributeError: 'float' object has no attribute 'isdigit'.
Cast value to string by str(x), but then for test is necessary also replace . and , to empty value for use isdigit:
df["col1"] = df["col1"].apply(lambda x: float(str(x).replace(',', '.')) if str(x).replace(',', '').replace('.', '').isdigit() else x)
But here is possible cast values to strings and then use Series.str.replace:
num_temps = pd.to_numeric(df["col1"].astype(str).str.replace(',', '.'), errors='coerce')
print (df)
col1
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
temps = num_temps[num_temps<100]
print(temps.max())
20.3
Alternative:
def f(x):
try:
return float(str(x).replace(',','.'))
except ValueError:
return np.nan
num_temps = df["col1"].apply(f)
print (num_temps)
0 12.0
1 13.1
2 NaN
3 20.3
4 NaN
5 12.5
6 200.9
Name: col1, dtype: float64
This works:
df.replace(",", ".", regex=True).replace("[a-zA-Z]+", np.NaN, regex=True).dropna().max()
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2
Cant figure out why .dropnan() is not dropping cells with NaN values?
help please, I've gone through the pandas documentation, dont know what Im doing wrong????
import pandas as pd
import quandl
import pandas as pd
df = quandl.get("GOOG/NYSE_SPY")
df2 = quandl.get("YAHOO/AAPL")
date = pd.date_range('2010-01-01', periods = 365)
df3 = pd.DataFrame(index = date)
df3 = df3.join(df['Open'], how = 'inner')
df3.rename(columns = {'Open': 'SPY'}, inplace = True)
df3 = df3.join(df2['Open'], how = 'inner')
df3.rename(columns = {'Open': 'AAPL'}, inplace = True)
df3['Spread'] = df3['SPY'] / df3['AAPL']
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
df3.plot()
print(df3)
change df3.dropna(how = 'any') to df3 = df3.dropna(how = 'any')
I tried to replicate your problem with a simple csv file:
In [6]: df
Out[6]:
a b
0 1.0 3.0
1 2.0 NaN
2 NaN 6.0
3 5.0 3.0
Both df.dropna(how='any') as well as df1 = df.dropna(how='any') work. Even just df.dropna() works. I am wondering whether your issue is because you are performing a division in the previous line:
df3 = df3 / df3.ix[0]
df3.dropna(how = 'any')
For instance, if I divide by df.ix[1], since one of the elements is a NaN, it converts all elements of a column in the result to NaN, and then if I remove NaNs using dropna, it will remove all rows:
In [17]: df.ix[1]
Out[17]:
a 2.0
b NaN
Name: 1, dtype: float64
In [18]: df2 = df / df.ix[1]
In [19]: df2
Out[19]:
a b
0 0.5 NaN
1 1.0 NaN
2 NaN NaN
3 2.5 NaN
In [20]: df2.dropna()
Out[20]:
Empty DataFrame
Columns: [a, b]
Index: []