Inconsistent behavior of ix selection with duplicate indices

Inconsistent behavior of ix selection with duplicate indices - python

Consider the pandas data frame
df = DataFrame({'somedata': [13,24,54]}, index=[1,1,2])
somedata
1 13
1 24
2 54
Executing
df.ix[1, 'somedata']
will return an object
1 13
1 24
Name: somedata, dtype: int64
which has an index:
df.ix[1, 'somedata'].index
Int64Index([1, 1], dtype='int64')
However, executing
df.ix[2, 'somedata']
will return just the number 54, which has no index:
df.ix[2, 'somedata'].index
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-274-3c6e4b1e6441> in <module>()
----> 1 df.ix[2, 'somedata'].index
AttributeError: 'numpy.int64' object has no attribute 'index'
I do not understand this (seemingly?) inconsistent behavior. Is it on purpose? I would expect objects that are returned from the same operation to have the same structure. Further, I need to build my code around this issue, so my question is, how do I detect what kind of object is returned by an ix selection? Currently I am checking for the returned object's len. I wonder if there is a more elegant way, or if one can force the selection, instead of returning just the number 54, to return the similar form
2 54
Name: somedata, dtype: int64
Sorry if this is a stupid question, I could not find an answer to this anywhere.

If you pass a list of indices instead of a single index, you can guarantee that you're going to get a Series back. In other words, instead of
>>> df.loc[1, 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[2, 'somedata']
54
you could use
>>> df.loc[[1], 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[[2], 'somedata']
2 54
Name: somedata, dtype: int64
(Note that it's usually a better idea to use loc (or iloc) rather than ix because it's less magical, although that isn't what's causing your issue here.)

Related

Stripping DataFrame column from text to make integer

I couldn't find an easy way to do this and none of the complex ways worked. Can you help?
I have a dataframe resulting from a web-scrape. In there I have a data['Milage'] column that has the following result: '80,000 miles'. Obviously that's a string, so I'm looking for a way to erase all content that isnt numeric and convert that string to straigt numbers
'80,000 miles' -> '80000'
I tried the following:
data['Milage'] = data['Milage'].str[1:].astype(int)
No idea what the code above does, I took it from another post from here. But I get the following error message:
File "autotrader.py", line 73, in <module>
data['Milage'] = data['Milage'].str[1:].astype(int)
AttributeError: 'str' object has no attribute 'str'
The other solution I tried was this:
data['Milage'] = str(data['Milage']).extract('(\d+)').astype(int)
And the resulting error is as follows:
File "autotrader.py", line 73, in <module>
data['Milage'] = str(data['Milage']).extract('(\d+)').astype(int)
AttributeError: 'str' object has no attribute 'extract'
I would appreciate any help! Thank you

After some test problem was data is dictionary, you need processing df for DataFrame.
I think you need remove non numeric values and convert to integers:
df['Milage'] = df['Milage'].str.replace('\D','').astype(int)
print(df['Milage'])
0 70000
1 69186
2 46820
3 54000
4 83600
5 139000
6 62000
7 51910
8 86000
9 38000
10 65000
11 119000
12 49500
13 60000
14 35000
15 57187
16 45050
17 80000
18 84330
19 85853
Name: Milage, dtype: int32

Converting exponential notation numbers to strings - explanation

I have DataFrame from this question:
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp))
print (df)
Total Price test_num
0 0 71.7 2.042560e+14
1 1 39.5 2.042540e+14
2 2 82.2 2.041880e+14
3 3 42.9 2.041710e+14
If convert floats to strings get trailing 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object
Solution is convert floats to integer64:
print (df['test_num'].astype('int64'))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: int64
print (df['test_num'].astype('int64').astype(str))
0 204256000000000
1 204254000000000
2 204188000000000
3 204171000000000
Name: test_num, dtype: object
Question is why it convert this way?
I add this poor explanation, but feels it should be better:
Poor explanation:
You can check dtype of converted column - it return float64.
print (df['test_num'].dtype)
float64
After converting to string it remove exponential notation and cast to floats, so added traling 0:
print (df['test_num'].astype('str'))
0 204256000000000.0
1 204254000000000.0
2 204188000000000.0
3 204171000000000.0
Name: test_num, dtype: object

When you use pd.read_csv to import data and do not define datatypes,
pandas makes an educated guess and in this case decides, that column
values like "2.04256e+14" are best represented by a float value.
This, converted back to string adds a ".0". As you corrently write,
converting to int64 fixes this.
If you know that the column has int64 values only before input (and
no empty values, which np.int64 cannot handle), you can force this type on import to avoid the unneeded conversions.
import numpy as np
temp=u"""Total,Price,test_num
0,71.7,2.04256e+14
1,39.5,2.04254e+14
2,82.2,2.04188e+14
3,42.9,2.04171e+14"""
df = pd.read_csv(pd.compat.StringIO(temp), dtype={2: np.int64})
print(df)
returns
Total Price test_num
0 0 71.7 204256000000000
1 1 39.5 204254000000000
2 2 82.2 204188000000000
3 3 42.9 204171000000000

Unable to correctly use Pandas Interpolate over a series

I am trying to use the interpolation functionality provided by Pandas, here but for some reason, cannot get my Series to adjust to the correct values. I casted them to a float64, but that did not appear to help. Any recommendations?
The code:
for feature in price_data:
print price_data[feature]
print "type:"
print type(price_data[feature])
newSeries = price_data[feature].astype(float).interpolate()
print "newSeries: "
print newSeries
The output:
0 178.9000
1 0.0000
2 178.1200
Name: open_price, dtype: object
type:
<class 'pandas.core.series.Series'>
newSeries:
0 178.90
1 0.00
2 178.12
Name: open_price, dtype: float64

The problem is that there is nothing to interpolate. I'm assuming you want to interpolate the value where zero is. In that case, replace the zero with np.nan then interpolate. One way to do this is
price_data.where(price_data != 0, np.nan).interpolate()
0 178.90
1 178.51
2 178.12
Name: open_price, dtype: float64

Rounding up a column

I am new to pandas python and I am having difficulties trying to round up all the values in the column. For example,
Example
88.9
88.1
90.2
45.1
I tried using my current code below, but it gave me:
AttributeError: 'str' object has no attribute 'rint'
df.Example = df.Example.round()

You can use numpy.ceil:
In [80]: import numpy as np
In [81]: np.ceil(df.Example)
Out[81]:
0 89.0
1 89.0
2 91.0
3 46.0
Name: Example, dtype: float64
depending on what you like, you could also change the type:
In [82]: np.ceil(df.Example).astype(int)
Out[82]:
0 89
1 89
2 91
3 46
Name: Example, dtype: int64
Edit
Your error message indicates you're trying just to round (not necessarily up), but are having a type problem. You can solve it like so:
In [84]: df.Example.astype(float).round()
Out[84]:
0 89.0
1 88.0
2 90.0
3 45.0
Name: Example, dtype: float64
Here, too, you can cast at the end to an integer type:
In [85]: df.Example.astype(float).round().astype(int)
Out[85]:
0 89
1 88
2 90
3 45
Name: Example, dtype: int64

I don't have privilege to comment. Mine is not a new answer. It is a compare of two answers. Only one of them worked as below.
First I tried this https://datatofish.com/round-values-pandas-dataframe/
df['DataFrame column'].apply(np.ceil)
It did not work for me.
Then I tried the above answer
np.ceil(df.Example).astype(int)
It worked.
I hope this will help someone.

understanding math errors in pandas dataframes

I'm trying to generate a new column in a pandas dataframe from other columns and am getting some math errors that I don't understand. Here is a snapshot of the problem and some simplifying diagnostics...
I can generate a data frame that looks pretty good:
import pandas
import math as m
data = {'loc':['1','2','3','4','5'],
'lat':[61.3850,32.7990,34.9513,14.2417,33.7712],
'lng':[-152.2683,-86.8073,-92.3809,-170.7197,-111.3877]}
frame = pandas.DataFrame(data)
frame
Out[15]:
lat lng loc
0 61.3850 -152.2683 1
1 32.7990 -86.8073 2
2 34.9513 -92.3809 3
3 14.2417 -170.7197 4
4 33.7712 -111.3877 5
5 rows × 3 columns
I can do simple math (i.e. degrees to radians):
In [32]:
m.pi*frame.lat/180.
Out[32]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
But I can't convert from degrees to radians using the python math library:
In [33]:
m.radians(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-99a986252f80> in <module>()
----> 1 m.radians(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
And can't even convert the values to floats to try to force it to work:
In [34]:
float(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-3311aee92f31> in <module>()
----> 1 float(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
I'm sure there must be a simple explanation and would appreciate your help in finding it. Thanks!

math functions such as math.radians expect a numeric value such as a float, not a sequence such as a pandas.Series.
Instead, you could use numpy.radians, since numpy.radians can accept an array as input:
In [95]: np.radians(frame['lat'])
Out[95]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
Only Series of length 1 can be converted to a float. So while
this works,
In [103]: math.radians(pd.Series([1]))
Out[103]: 0.017453292519943295
in general it does not:
In [104]: math.radians(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>
math.radians is calling float on its argument. Note that you get the same error calling float on pd.Series([1,2]):
In [105]: float(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>

I had a similar issue but was using a custom function. The solution was to use the apply function:
def monthdiff(x):
z = (int(x/100) * 12) + (x - int(x/100) * 100)
return z
series['age'].apply(monthdiff)
Now, I have a new column with my simple (yet beautiful) calculation applied to every line in the data frame!

try:
pd.to_numeric()
When I got the same error, this is what worked for me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inconsistent behavior of ix selection with duplicate indices - python

Related

Stripping DataFrame column from text to make integer

Converting exponential notation numbers to strings - explanation

Unable to correctly use Pandas Interpolate over a series

Rounding up a column

understanding math errors in pandas dataframes

Categories

Resources