understanding math errors in pandas dataframes - python

I'm trying to generate a new column in a pandas dataframe from other columns and am getting some math errors that I don't understand. Here is a snapshot of the problem and some simplifying diagnostics...
I can generate a data frame that looks pretty good:
import pandas
import math as m
data = {'loc':['1','2','3','4','5'],
'lat':[61.3850,32.7990,34.9513,14.2417,33.7712],
'lng':[-152.2683,-86.8073,-92.3809,-170.7197,-111.3877]}
frame = pandas.DataFrame(data)
frame
Out[15]:
lat lng loc
0 61.3850 -152.2683 1
1 32.7990 -86.8073 2
2 34.9513 -92.3809 3
3 14.2417 -170.7197 4
4 33.7712 -111.3877 5
5 rows × 3 columns
I can do simple math (i.e. degrees to radians):
In [32]:
m.pi*frame.lat/180.
Out[32]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
But I can't convert from degrees to radians using the python math library:
In [33]:
m.radians(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-99a986252f80> in <module>()
----> 1 m.radians(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
And can't even convert the values to floats to try to force it to work:
In [34]:
float(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-3311aee92f31> in <module>()
----> 1 float(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
I'm sure there must be a simple explanation and would appreciate your help in finding it. Thanks!

math functions such as math.radians expect a numeric value such as a float, not a sequence such as a pandas.Series.
Instead, you could use numpy.radians, since numpy.radians can accept an array as input:
In [95]: np.radians(frame['lat'])
Out[95]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
Only Series of length 1 can be converted to a float. So while
this works,
In [103]: math.radians(pd.Series([1]))
Out[103]: 0.017453292519943295
in general it does not:
In [104]: math.radians(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>
math.radians is calling float on its argument. Note that you get the same error calling float on pd.Series([1,2]):
In [105]: float(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>

I had a similar issue but was using a custom function. The solution was to use the apply function:
def monthdiff(x):
z = (int(x/100) * 12) + (x - int(x/100) * 100)
return z
series['age'].apply(monthdiff)
Now, I have a new column with my simple (yet beautiful) calculation applied to every line in the data frame!

try:
pd.to_numeric()
When I got the same error, this is what worked for me.

Related

Create column with data and float data types

I work with a dataframe named emails_visits:
pandas is imported
Rep Doctor Date type
0 1 1 2021-01-25 email
1 1 1 2021-05-29 email
2 1 2 2021-03-15 email
3 1 2 2021-04-02 email
4 1 2 2021-04-29 email
30 1 2 2021-06-01 visit
5 1 3 2021-01-01 email
I want to create column "date_after" based on value in column type if it is equal to "visits" I would like to see date from column "date" otherwise empty.
I use this code:
emails_visits["date_after"]=np.where(emails_visits["type"]=="visit",emails_visits["Date"],np.nan)
However, it raise an error:
emails_visits["date_after"]=np.where(emails_visits["type"]=="visit",emails_visits["Date"],np.nan)
File "<__array_function__ internals>", line 5, in where
TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)
How can I fix this?
You can do it like this if you want.
email_visits['date after'] = email_visits.apply(lambda x: x[2] if x[3] == 'visit' else '', axis=1)
The type datetime64 of the column Date of emails_visits is incompatible with the one of np.nan which is a np.float64. Since it seems you use Pandas, you need to use pd.NA instead which is used for missing values (while np.nan means that the value is not a number and only applies for floating-point numbers). In fact, it is better not to use np.where here but pandas functions. Here is a simple solution:
emails_visits["date_after"] = emails_visits["Date"].where(emails_visits["type"]=="visit")

sklearn/PCA - Error while trying to transform the high dimentional data

I encountered data error while trying to convert my high dimensional vector into 2 dimension using PCA.
This is my input data, each row has 300 dimensions:
vector
0 [0.01053525, -0.007869658, 0.0024931028, -0.04...
1 [-0.024436072, -0.016484523, 0.03859031, 0.000...
2 [0.015011676, -0.020465894, 0.004854744, -0.00...
3 [-0.010836455, -0.006562917, 0.00265073, 0.022...
4 [-0.018123362, -0.026007563, 0.04781856, -0.03...
... ...
45124 [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125 [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126 [-0.021875003, -0.005635035, 0.0076896898, -0....
45127 [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128 [0.007794927, 0.0019561667, 0.15995999, -0.054...
[45129 rows x 1 columns]
My Code:
data = pd.read_parquet('1.parquet', engine='fastparquet')
reduced = pca.fit_transform(data)
Error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.
Edit
>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vector 45129 non-null object
dtypes: object(1)
memory usage: 352.7+ KB
Scikit-learn doesn't know how to handle a column that contains an array (list), so you'll need to expand the column. Since each row has an array of the same size, you can do this fairly easily with only 45,000 rows. Once you expand your data, you should be fine.
import pandas as pd
from sklearn.decomposition import PCA
​
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0 1 2
0 0.01 0.02 0.03
1 0.04 0.40 0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01, 1.43048962e-17],
[-1.93778224e-01, 1.43048962e-17]])

Why am I receiving an error when I try to sum multiple columns in pandas dataframe?

df=pd.read_excel('Canada.xlsx',sheet_name='Canada by Citizenship',skiprows=range(20),skipfooter=2)
df['Total']=df.iloc[:,'1980':'2013'].sum(axis=1)
Here is the error I received:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1980] of <class 'str'>
I received the dataset from this link
The columns are integers. You can slice with:
df.loc[:, range(1980, 2014)].sum(1)
# or
df.iloc[:, df.columns.get_loc(1980):df.columns.get_loc(2013)+1].sum(1)
0 58639
1 15699
2 69439
3 6
4 15
...
190 97146
191 2
192 2985
193 1677
194 8598
Length: 195, dtype: int64
The iloc() function requires only positional arguments. You could use the loc() function to do it in on line.
Here is what it could look like :
df.loc[:, [1980 + i for i in range(34)] ].sum(axis=1)

Seaborn plots kdeplot but not distplot

I want to plot data from a pandas dataframe column formed from couchdb. This is what code and output from the data:
print df4.Patient_Age
Doc_ID
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
Name: Patient_Age, dtype: object
If I execute this code:
sns.kdeplot(df4.Patient_Age)
the plot is generated as expected. However, when I run this:
sns.distplot(df4.Patient_Age)
I get the following error with distplot:
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
To correct the error, I use:
df4.Patient_Age = [int(i) for i in df4.Patient_Age]
all(isinstance(item,int) for item in df4.Patient_Age)
The output is:
False
What I would like to understand are:
Why was the kdeplot being generated earlier but not the histplot?
When I changed the datatype to int, why do I still get a False? And if the data is not int (as indicated by False), why does the histplot work after the transformation?
The problem is that your values are not numeric. If you force them to integers or floats, it will work.
from io import StringIO
import pandas
import seaborn
seaborn.set(style='ticks')
data = StringIO("""\
Doc_ID Age
000103f8-7f48-4afd-b532-8e6c1028d965 99
00021ec5-9945-47f7-bfda-59cf8918f10b 92
0002510f-fb89-11e3-a6eb-742f68319ca7 32
00025550-9a97-44a4-84d9-1f6f7741f973 73
0002d1b8-b576-4db7-af55-b3f26f7ca63d 49
0002d40f-2b45-11e3-8f66-742f68319ca7 42
000307eb-18a6-47cd-bb03-33e484fad029 18
00033d3d-1345-4739-9522-b41b8db3ee23 42
00036d2e-0a51-4cfb-93d1-3e137a026f19 42
0003b054-5f3b-4553-8104-f71d7a940d84 10
""")
df = pandas.read_table(data, sep='\s+')
df['Age'] = df['Age'].astype(float)
df.info()
# prints
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
Doc_ID 10 non-null object
Age 10 non-null float64
dtypes: float64(1), object(1)
memory usage: 240.0+ bytes
So then:
seaborn.distplot(df['Age'])
Give me:

Inconsistent behavior of ix selection with duplicate indices

Consider the pandas data frame
df = DataFrame({'somedata': [13,24,54]}, index=[1,1,2])
somedata
1 13
1 24
2 54
Executing
df.ix[1, 'somedata']
will return an object
1 13
1 24
Name: somedata, dtype: int64
which has an index:
df.ix[1, 'somedata'].index
Int64Index([1, 1], dtype='int64')
However, executing
df.ix[2, 'somedata']
will return just the number 54, which has no index:
df.ix[2, 'somedata'].index
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-274-3c6e4b1e6441> in <module>()
----> 1 df.ix[2, 'somedata'].index
AttributeError: 'numpy.int64' object has no attribute 'index'
I do not understand this (seemingly?) inconsistent behavior. Is it on purpose? I would expect objects that are returned from the same operation to have the same structure. Further, I need to build my code around this issue, so my question is, how do I detect what kind of object is returned by an ix selection? Currently I am checking for the returned object's len. I wonder if there is a more elegant way, or if one can force the selection, instead of returning just the number 54, to return the similar form
2 54
Name: somedata, dtype: int64
Sorry if this is a stupid question, I could not find an answer to this anywhere.
If you pass a list of indices instead of a single index, you can guarantee that you're going to get a Series back. In other words, instead of
>>> df.loc[1, 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[2, 'somedata']
54
you could use
>>> df.loc[[1], 'somedata']
1 13
1 24
Name: somedata, dtype: int64
>>> df.loc[[2], 'somedata']
2 54
Name: somedata, dtype: int64
(Note that it's usually a better idea to use loc (or iloc) rather than ix because it's less magical, although that isn't what's causing your issue here.)

Categories