Stripping DataFrame column from text to make integer - python

I couldn't find an easy way to do this and none of the complex ways worked. Can you help?
I have a dataframe resulting from a web-scrape. In there I have a data['Milage'] column that has the following result: '80,000 miles'. Obviously that's a string, so I'm looking for a way to erase all content that isnt numeric and convert that string to straigt numbers
'80,000 miles' -> '80000'
I tried the following:
data['Milage'] = data['Milage'].str[1:].astype(int)
No idea what the code above does, I took it from another post from here. But I get the following error message:
File "autotrader.py", line 73, in <module>
data['Milage'] = data['Milage'].str[1:].astype(int)
AttributeError: 'str' object has no attribute 'str'
The other solution I tried was this:
data['Milage'] = str(data['Milage']).extract('(\d+)').astype(int)
And the resulting error is as follows:
File "autotrader.py", line 73, in <module>
data['Milage'] = str(data['Milage']).extract('(\d+)').astype(int)
AttributeError: 'str' object has no attribute 'extract'
I would appreciate any help! Thank you

After some test problem was data is dictionary, you need processing df for DataFrame.
I think you need remove non numeric values and convert to integers:
df['Milage'] = df['Milage'].str.replace('\D','').astype(int)
print(df['Milage'])
0 70000
1 69186
2 46820
3 54000
4 83600
5 139000
6 62000
7 51910
8 86000
9 38000
10 65000
11 119000
12 49500
13 60000
14 35000
15 57187
16 45050
17 80000
18 84330
19 85853
Name: Milage, dtype: int32

Related

Pandas apply polyfit to a series against a value of the series

I'm new to the Pandas world and it has been hard to stop thinking sequentially.
I have a Series like:
df['sensor'].head(30)
0 6.8855
1 6.8855
2 6.8875
3 6.8885
4 6.8885
5 6.8895
6 6.8895
7 6.8895
8 6.8905
9 6.8905
10 6.8915
11 6.8925
12 6.8925
13 6.8925
14 6.8925
15 6.8925
16 6.8925
17 6.8925
Name: int_price, dtype: float64
I want to calculate the polynomial fit of the first value against all others to the find and average. I defined a function to do the calculation and I want it to be applied onto the series.
The function:
def linear_trend(a,b):
return np.polyfit([1,2],[a,b],1)
The application:
a = pd.Series(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index)))
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
This returns TypeError: No loop matching the specified signature and casting was found for ufunc lstsq_m.
or this:
a = df_plot['sensor'].iloc[0]
df['ref'] = df_plot['sensor'].apply(lambda df_plot: linear_trend(a,df['sensor']))
That returns ValueError: setting an array element with a sequence.
How can I solve this?
I was able to work around my issue by doing the following:
a = pd.Series(data=(df_plot['sensor'].iloc[0] for x in range(len(df_plot.index))), name='sensor_ref')
df_poly = pd.concat([a,df_plot['sensor']],axis=1)
df_plot['slope'] = df_poly[['sensor_ref','sensor']].apply(lambda df_poly: linear_trend(df_poly['sensor_ref'],df_poly['sensor']), axis=1)
If you have a better method, it's welcome.

ValueError: could not convert string to float - without positional indication

For a current project, I am planning to run a scikit-learn Stochastic Graduent Booster algorithm over a CSV set that includes numerical data.
When calling line sgbr.fit(X_train, y_train) of the script, I am however receiving a ValueError: could not convert string to float: with no further details given on the respective area that cannot be formatted.
I assume that this error is not related to the Python code itself but rather the CSV input. I have however already checked the CSV file to confirm all sections exclusively include floats:
Does anyone have an idea why the ValueError is appearing without further positional indication?
I thing there are not direct function to get positional indication.
you can try this to convert
print (df)
column
0 01
1 02
2 03
3 04
4 05
5 LS
print (pd.to_numeric(df.column.str, errors='coerce'))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
Name: column, dtype: float64

Precision lost while using read_csv in pandas

I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.

Pandas error "Can only use .str accessor with string values"

I have the following input file:
"Name",97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,
And I am reading it in with:
#!/usr/bin/env python
import pandas as pd
import sys
import numpy as np
filename = sys.argv[1]
df = pd.read_csv(filename,header=None)
for col in df.columns[2:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
print df
However, I get the error
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2241, in __getattr__
return object.__getattribute__(self, name)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 188, in __get__
return self.construct_accessor(instance)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 528, in _make_str_accessor
raise AttributeError("Can only use .str accessor with string "
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
This worked OK in pandas 0.14 but does not work in pandas 0.17.0.
It's happening because your last column is empty so this becomes converted to NaN:
In [417]:
t="""'Name',97.7,0A,0A,65M,0A,100M,5M,75M,100M,90M,90M,99M,90M,0#,0N#,"""
df = pd.read_csv(io.StringIO(t), header=None)
df
Out[417]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 'Name' 97.7 0A 0A 65M 0A 100M 5M 75M 100M 90M 90M 99M 90M 0#
15 16
0 0N# NaN
If you slice your range up to the last row then it works:
In [421]:
for col in df.columns[2:-1]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[421]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
Alternatively you can just select the cols that are object dtype and run the code (skipping the first col as this is the 'Name' entry):
In [428]:
for col in df.select_dtypes([np.object]).columns[1:]:
df[col] = df[col].str.extract(r'(\d+\.*\d*)').astype(np.float)
df
Out[428]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 'Name' 97.7 0 0 65 0 100 5 75 100 90 90 99 90 0 0 NaN
I got this error while working in Eclipse. It turned out that the project interpreter was somehow (after an update I believe) reset to Python 2.7. Setting it back to Python 3.6 resolved this issue. It all resulted in several crashes, restarts and warnings. After several minutes of troubles it seems fixed now.
While I know this is not a solution to the problem posed here, I thought it might be useful for others, as I came to this page after searching for this error.
In this case we have to use the str.replace() method on that series, but first we have to convert it to str type:
df1.Patient = 's125','s45',s588','s244','s125','s123'
df1 = pd.read_csv("C:\\Users\\Gangwar\\Desktop\\competitions\\cancer prediction\\kaggle_to_students.csv")
df1.Patient = df1.Patient.astype(str)
df1['Patient'] = df1['Patient'].str.replace('s','').astype(int)

Inconsistent behavior of ix selection with duplicate indices

Consider the pandas data frame
df = DataFrame({'somedata': [13,24,54]}, index=[1,1,2])
somedata
1 13
1 24
2 54
Executing
df.ix[1, 'somedata']
will return an object
1 13
1 24
Name: somedata, dtype: int64
which has an index:
df.ix[1, 'somedata'].index
Int64Index([1, 1], dtype='int64')
However, executing
df.ix[2, 'somedata']
will return just the number 54, which has no index:
df.ix[2, 'somedata'].index
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-274-3c6e4b1e6441> in <module>()
----> 1 df.ix[2, 'somedata'].index
AttributeError: 'numpy.int64' object has no attribute 'index'
I do not understand this (seemingly?) inconsistent behavior. Is it on purpose? I would expect objects that are returned from the same operation to have the same structure. Further, I need to build my code around this issue, so my question is, how do I detect what kind of object is returned by an ix selection? Currently I am checking for the returned object's len. I wonder if there is a more elegant way, or if one can force the selection, instead of returning just the number 54, to return the similar form
2 54
Name: somedata, dtype: int64
Sorry if this is a stupid question, I could not find an answer to this anywhere.
If you pass a list of indices instead of a single index, you can guarantee that you're going to get a Series back. In other words, instead of
>>> df.loc[1, 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[2, 'somedata']
54
you could use
>>> df.loc[[1], 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[[2], 'somedata']
2 54
Name: somedata, dtype: int64
(Note that it's usually a better idea to use loc (or iloc) rather than ix because it's less magical, although that isn't what's causing your issue here.)

Categories