Pandas error "Can only use .str accessor with string values" - python

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.

It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN

I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.

In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

Related

Getting array with empty values in flask but get correct values in a python notebook

I am making a basic web application which takes the inputs for a logistic regression model and returns the class in which it lies. Here is the code for the prediction:
test_data = pd.Series([battery_power, blue, clock_speed, dual_sim, fc, four_g,
int_memory, m_dep, mobile_wt, n_cores, pc, px_height,
px_width, ram, sc_h, sc_w, talk_time, three_g,
touch_screen, wifi])
df = pd.read_csv("Users\ADMIN\Desktop\project\mobiledata_clean.csv")
df.drop(['Unnamed: 0', 'price_range'], inplace=True, axis=1)
print(df)
print(test_data)
#scaling the values
xpred= np.array((test_data-df.min())/(df.max()-df.min())).reshape(1,-1)
print(xpred)
the test_data is:
0 842
1 0
2 2.2
3 0
4 1
5 0
6 7
7 0.6
8 188
9 2
10 2
11 20
12 756
13 2549
14 9
15 7
16 19
17 0
18 0
19 1
dtype: object
Here's the dataframe in df:
df
I get a (1,40) array of null values in the xpred variable. can someone tell me why this is happening
The data type of test_data is showing as an object, try casting it into float and then do the operations maybe.
For series: s.astype('int64')
For dataframe: df.astype({'col1': 'int64'})

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

sort_values() with key in Python

I have a dataframe where the column names are times (0:00, 0:10, 0:20, ..., 23:50). Right now, they're sorted in a string order (so 0:00 is first and 9:50 is last) but I want to sort them after time (so 0:00 is first and 23:50 is last).
If time is a column, you can use
df = df.sort(columns='Time',key=float)
But 1) that only works if time is a column itself, rather than the column names, and 2) sort() is deprecated so I try to abstain from using it.
I'm trying to use
df = df.sort_index(axis = 1)
but since the column names are in string format, they get sorted according to a string key. I've tried
df = df.sort_index(key=float, axis=1)
but that gives an error message:
Traceback (most recent call last):
File "<ipython-input-112-5663f277da66>", line 1, in <module>
df.sort_index(key=float, axis=1)
TypeError: sort_index() got an unexpected keyword argument 'key'
Does anyone have ideas for how to fix this? So annoying that sort_index() - and sort_values() for that matter - don't have the key argument!!
Try sorting the columns with the sorted builtin function and passing the output to the dataframe for indexing. The following should serve as a working example:
import pandas as pd
records = [(2, 33, 23, 45), (3, 4, 2, 4), (4, 5, 7, 19), (4, 6, 71, 2)]
df = pd.DataFrame.from_records(records, columns = ('0:00', '23:40', '12:30', '11:23'))
df
# 0:00 23:40 12:30 11:23
# 0 2 33 23 45
# 1 3 4 2 4
# 2 4 5 7 19
# 3 4 6 71 2
df[sorted(df,key=pd.to_datetime)]
# 0:00 11:23 12:30 23:40
# 0 2 45 23 33
# 1 3 4 2 4
# 2 4 19 7 5
# 3 4 2 71 6
I hope this helps
Just prepend a leading zero to one-digit hours. This should be the simplest solution as you can simply sort lexically then.
E.g. 5:30 -> 05:30.
Here is a working demo, which implements #MartinKrämer's idea:
import re
In [259]: df
Out[259]:
23:40 0:00 19:19 12:30 09:00 11:23
0 33 2 1 23 12 45
1 4 3 1 2 13 4
2 5 4 1 7 14 19
3 6 4 1 71 14 2
In [260]: df.rename(columns=lambda x: re.sub(r'^(\d{1})\:', r'0\1:', x)).sort_index(axis=1)
Out[260]:
00:00 09:00 11:23 12:30 19:19 23:40
0 2 12 45 23 1 33
1 3 13 4 2 1 4
2 4 14 19 7 1 5
3 4 14 2 71 1 6
I know this question is a few years old, but since it's the top Google result for this question, I wanted to provide the root cause of the error.
The 'key' argument was added to sort_values in version 1.1.0. See the note in the documentation linked below.
pandas.DataFrame.sort_values
This feature will very like work as you intended if you upgrade to 1.1.0 or higher.
It seems sort_values() with key may not work. However, sort_index() with key can do the thing.
Referring Abdou
enter image description here

align timeseries in pandas

I have 2 time series.
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df.index=pd.to_datetime(df['Time'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df1.index=pd.to_datetime(df1['Time'])
I am trying to align the time series so the index is in order. I am guessing reindex_like is what I need but not sure how to use it.
Here is my desired output
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Here is what you need:
df.append(df1).sort().reset_index(drop=True)
If you need to compile more pieces together, it is more efficient to use pd.concat(<names of all your dataframes as a list>).
P.S. You code is a bit redundant: you don't need to cast Time into index if you don't need it there. You can sort values based on any column, like this:
import pandas as pd
df=pd.DataFrame([
['1/10/12',10],
['1/11/12',11],
['1/12/12',13],
['1/14/12',12],
],
columns=['Time','n'])
df1=pd.DataFrame([
['1/13/12',88],
],columns=['Time','n']
)
df.append(df1).sort_values('Time')
You can use concat, sort_index and reset_index:
df = pd.concat([df,df1]).sort_index().reset_index(drop=True)
print df
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12
Or you can use ordered_merge:
print pd.ordered_merge(df, df1)
Time n
0 1/10/12 10
1 1/11/12 11
2 1/12/12 13
3 1/13/12 88
4 1/14/12 12

python pandas text block to data frame mixed types

I am a python and pandas newbie. I have a text block that has data arranged in columns. The data in the first six columns are integers and the rest are floating point. I tried to create two DataFrames that I could then concatenate:
sect1 = DataFrame(dtype=int)
sect2 = DataFrame(dtype=float)
i = 0
# The first 26 lines are header text
for line in txt[26:]:
colmns = line.split()
sect1[i] = colmns[:6] # Columns with integers
sect2[i] = colmns[6:] # Columns with floating point
i +=
This causes an AssertionError: Length of values does not match length of index
Here are two lines of data
2013 11 15 0000 56611 0 1.36e+01 3.52e-01 7.89e-02 4.33e-02 3.42e-02 1.76e-02 2.89e+04 5.72e+02 -1.00e+05
2013 11 15 0005 56611 300 1.08e+01 5.50e-01 2.35e-01 4.27e-02 3.35e-02 1.70e-02 3.00e+04 5.50e+02 -1.00e+05
Thanks in advance for the help.
You can use Pandas csv parser along with StringIO. An example in pandas documentation.
For you sample that will be:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> data = """2013 11 15 0000 56611 0 1.36e+01 3.52e-01 7.89e-02 4.33e-02 3.42e-02 1.76e-02 2.89e+04 5.72e+02 -1.00e+05
... 2013 11 15 0005 56611 300 1.08e+01 5.50e-01 2.35e-01 4.27e-02 3.35e-02 1.70e-02 3.00e+04 5.50e+02 -1.00e+05"""
Load data
>>> df = pd.read_csv(StringIO(data), sep=r'\s+', header=None)
Convert first three rows to datetime (optional)
>>> df[0] = df.iloc[:,:3].apply(lambda x:'{}.{}.{}'.format(*x), axis=1).apply(pd.to_datetime)
>>> del df[1]
>>> del df[2]
>>> df
0 3 4 5 6 7 8 9 10 \
0 2013-11-15 00:00:00 0 56611 0 13.6 0.352 0.0789 0.0433 0.0342
1 2013-11-15 00:00:00 5 56611 300 10.8 0.550 0.2350 0.0427 0.0335
11 12 13 14
0 0.0176 28900 572 -100000
1 0.0170 30000 550 -100000

Categories