I notice that many DataFrame functions if used without parentheses seem to behave like 'properties' e.g.
In [200]: df = DataFrame (np.random.randn (7,2))
In [201]: df.head ()
Out[201]:
0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
In [202]: df.head
Out[202]:
<bound method DataFrame.head of 0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
5 0.940440 -2.279781
6 1.152370 -2.733546>
How is this done and is it good practice ?
This is with pandas 0.15.1 on linux
They are different and not recommended, one clearly shows that it's a method and happens to output the results whilst the other shows the expected output.
Here's why you should not do this:
In [23]:
t = df.head
In [24]:
t.iloc[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-24-b523e5ce509d> in <module>()
----> 1 t.iloc[0]
AttributeError: 'function' object has no attribute 'iloc'
In [25]:
t = df.head()
t.iloc[0]
Out[25]:
0 0.712635
1 0.363903
Name: 0, dtype: float64
So OK you don't use parentheses to call the method correctly and see an output that appears valid but if you took a reference to this and tried to use it, you are operating on the method rather than the slice of the df which is not what you intended.
Related
I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.
I'm currently working with pandas and ipython. Since pandas dataframes are copied when you perform operations with it, my memory usage increases by 500 mb with every cell. I believe it's because the data gets stored in the Out variable, since this doesn't happen with the default python interpreter.
How do I disable the Out variable?
The first option you have is to avoid producing output. If you don't really need to see the intermediate results just avoid them and put all the computations in a single cell.
If you need to actually display that data you can use InteractiveShell.cache_size option to set a maximum size for the cache. Setting this value to 0 disables caching.
To do so you have to create a file called ipython_config.py (or ipython_notebook_config.py) under your ~/.ipython/profile_default directory with the contents:
c = get_config()
c.InteractiveShell.cache_size = 0
After that you'll see:
In [1]: 1
Out[1]: 1
In [2]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
You can also create different profiles for ipython using the command ipython profile create <name>. This will create a new profile under ~/.ipython/profile_<name> with a default configuration file. You can then launch ipython using the --profile <name> option to load that profile.
Alternatively you can use the %reset out magic to reset the output cache or use the %xdel magic to delete a specific object:
In [1]: 1
Out[1]: 1
In [2]: 2
Out[2]: 2
In [3]: %reset out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (2 entries)
In [4]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
In [5]: 1
Out[5]: 1
In [6]: 2
Out[6]: 2
In [7]: v = Out[5]
In [8]: %xdel v # requires a variable name, so you cannot write %xdel Out[5]
In [9]: Out[5] # xdel removes the value of v from Out and other caches
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-573c4eba9654> in <module>()
----> 1 Out[5]
KeyError: 5
I'm currently working with pandas and ipython. Since pandas dataframes are copied when you perform operations with it, my memory usage increases by 500 mb with every cell. I believe it's because the data gets stored in the Out variable, since this doesn't happen with the default python interpreter.
How do I disable the Out variable?
The first option you have is to avoid producing output. If you don't really need to see the intermediate results just avoid them and put all the computations in a single cell.
If you need to actually display that data you can use InteractiveShell.cache_size option to set a maximum size for the cache. Setting this value to 0 disables caching.
To do so you have to create a file called ipython_config.py (or ipython_notebook_config.py) under your ~/.ipython/profile_default directory with the contents:
c = get_config()
c.InteractiveShell.cache_size = 0
After that you'll see:
In [1]: 1
Out[1]: 1
In [2]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
You can also create different profiles for ipython using the command ipython profile create <name>. This will create a new profile under ~/.ipython/profile_<name> with a default configuration file. You can then launch ipython using the --profile <name> option to load that profile.
Alternatively you can use the %reset out magic to reset the output cache or use the %xdel magic to delete a specific object:
In [1]: 1
Out[1]: 1
In [2]: 2
Out[2]: 2
In [3]: %reset out
Once deleted, variables cannot be recovered. Proceed (y/[n])? y
Flushing output cache (2 entries)
In [4]: Out[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-d74cffe9cfe3> in <module>()
----> 1 Out[1]
KeyError: 1
In [5]: 1
Out[5]: 1
In [6]: 2
Out[6]: 2
In [7]: v = Out[5]
In [8]: %xdel v # requires a variable name, so you cannot write %xdel Out[5]
In [9]: Out[5] # xdel removes the value of v from Out and other caches
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-573c4eba9654> in <module>()
----> 1 Out[5]
KeyError: 5
I have a hard time bug fixing my code which worked fine in testing on a small subset of the entire data. I could double check types to be sure, but the error message is already informative enough: The list I made ended up being a float. But how?
The last three lines which ran:
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments = all_treatments.drop(['DIAGNOS','INDATUMA','date'], axis=1)
all_treatments['tobacco'] = tobacco(diagnoses)
The error:
Traceback (most recent call last):
File "treatments2_noiopro.py", line 97, in <module>
all_treatments['tobacco'] = tobacco(diagnoses)
File "treatments2_noiopro.py", line 13, in tobacco
for codes in codes_column]
TypeError: 'float' object is not iterable
FWIW, the function itself is:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code <'F18'
for code in codes) if codes else False
for codes in codes_column]
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.
You can use str.split on the series and apply a function to the result:
def tobacco(codes):
return any(['C30' <= code < 'C40' or 'F17' <= code <'F18' for code in codes])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df
DIAGNOS
0 C35 C50
1 C36
2 C37
3 C50 C51
4 F1 F2
5 F17
6 F3 F17
7
df.DIAGNOS.str.split(' ').apply(tobacco)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 False
dtype: bool
edit:
Seems like using str.contains is significantly faster than both methods.
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('C3')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df.DIAGNOS.str.contains(tobacco_codes)
I guess diagnoses is a generator and since you drop something in line 2 of your code this changes the generator. I can't test anything right now, but let me know if it works when commenting line 2 of your code.
On the following series:
0 1411161507178
1 1411138436009
2 1411123732180
3 1411167606146
4 1411124780140
5 1411159331327
6 1411131745474
7 1411151831454
8 1411152487758
9 1411137160544
Name: my_series, dtype: int64
This command (convert to timestamp, localize and convert to EST) works:
pd.to_datetime(my_series, unit='ms').apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))
but this one fails:
pd.to_datetime(my_series, unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
with:
TypeError Traceback (most recent call last)
<ipython-input-3-58187a4b60f8> in <module>()
----> 1 lua = pd.to_datetime(df[column], unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
and so does this one:
my_series.tz_localize('UTC').tz_convert('US/Eastern')
with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-0a7cb1e94e1e> in <module>()
----> 1 lua = df[column].tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
As far as I understand, the second approach above (the first one that fails) should work. Why does it fail?
As Jeff's answer mentions, tz_localize() and tz_convert() act on the index, not the data. This was a huge surprise to me too.
Since Jeff's answer was written, Pandas 0.15 added a new Series.dt accessor that helps your use case. You can now do this:
pd.to_datetime(my_series, unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
tz_localize/tz_convert act on the INDEX of the object, not on the values. Easiest to simply turn it into an index then localize and convert. If you then want a Series back you can use to_series()
In [47]: pd.DatetimeIndex(pd.to_datetime(s,unit='ms')).tz_localize('UTC').tz_convert('US/Eastern')
Out[47]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-19 17:18:27.178000-04:00, ..., 2014-09-19 10:32:40.544000-04:00]
Length: 10, Freq: None, Timezone: US/Eastern
this work fine
pd.to_datetime(my_series,unit='ms', utc=True).dt.tz_convert('US/Eastern')