Unable to apply methods on timestamps using Series built-ins - python

On the following series:
0 1411161507178
1 1411138436009
2 1411123732180
3 1411167606146
4 1411124780140
5 1411159331327
6 1411131745474
7 1411151831454
8 1411152487758
9 1411137160544
Name: my_series, dtype: int64
This command (convert to timestamp, localize and convert to EST) works:
pd.to_datetime(my_series, unit='ms').apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))
but this one fails:
pd.to_datetime(my_series, unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
with:
TypeError Traceback (most recent call last)
<ipython-input-3-58187a4b60f8> in <module>()
----> 1 lua = pd.to_datetime(df[column], unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
and so does this one:
my_series.tz_localize('UTC').tz_convert('US/Eastern')
with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-0a7cb1e94e1e> in <module>()
----> 1 lua = df[column].tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
As far as I understand, the second approach above (the first one that fails) should work. Why does it fail?

As Jeff's answer mentions, tz_localize() and tz_convert() act on the index, not the data. This was a huge surprise to me too.
Since Jeff's answer was written, Pandas 0.15 added a new Series.dt accessor that helps your use case. You can now do this:
pd.to_datetime(my_series, unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

tz_localize/tz_convert act on the INDEX of the object, not on the values. Easiest to simply turn it into an index then localize and convert. If you then want a Series back you can use to_series()
In [47]: pd.DatetimeIndex(pd.to_datetime(s,unit='ms')).tz_localize('UTC').tz_convert('US/Eastern')
Out[47]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-19 17:18:27.178000-04:00, ..., 2014-09-19 10:32:40.544000-04:00]
Length: 10, Freq: None, Timezone: US/Eastern

this work fine
pd.to_datetime(my_series,unit='ms', utc=True).dt.tz_convert('US/Eastern')

Related

TypeError while formatting pandas.df.pct_change() output to percentage

I'm trying to calculate the daily returns of stock in percentage format from a CSV file by defining a function.
Here's my code:
def daily_ret(ticker):
return f"{df[ticker].pct_change()*100:.2f}%"
When I call the function, I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-7122588f1289> in <module>()
----> 1 daily_ret('AAPL')
<ipython-input-39-7dd6285eb14d> in daily_ret(ticker)
1 def daily_ret(ticker):
----> 2 return f"{df[ticker].pct_change()*100:.2f}%"
TypeError: unsupported format string passed to Series.__format__
Where am I going wrong?
f-strings can't be used to format iterables like that, even Series:
Use map or apply instead:
def daily_ret(ticker):
return (df[ticker].pct_change() * 100).map("{:.2f}%".format)
def daily_ret(ticker):
return (df[ticker].pct_change() * 100).apply("{:.2f}%".format)
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(1, 6)})
print(daily_ret('A'))
0 nan%
1 100.00%
2 50.00%
3 33.33%
4 25.00%
Name: A, dtype: object

How can I iterate through elements of a koala groupby?

I would like to iterate through groups in a dataframe. This is possible in pandas, but when I port this to koalas, I get an error.
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
for a in df.groupby('x'):
print(a)
Here is the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-35-d4164d1f71e0> in <module>
----> 1 for a in df.groupby('x'):
2 print(a)
/opt/conda/lib/python3.7/site-packages/databricks/koalas/groupby.py in __getitem__(self, item)
2630 if self._as_index and is_name_like_value(item):
2631 return SeriesGroupBy(
-> 2632 self._kdf._kser_for(item if is_name_like_tuple(item) else (item,)),
2633 self._groupkeys,
2634 dropna=self._dropna,
/opt/conda/lib/python3.7/site-packages/databricks/koalas/frame.py in _kser_for(self, label)
721 Name: id, dtype: int64
722 """
--> 723 return self._ksers[label]
724
725 def _apply_series_op(self, op, should_resolve: bool = False):
KeyError: (0,)
Is this kind of group iteration possible in koalas? The koalas documentation kind of implies it is possible - https://koalas.readthedocs.io/en/latest/reference/groupby.html
Groupby iteration is not yet implemented:
https://github.com/databricks/koalas/issues/2014

Python-Top Ten Function

I'm trying to create a function where the user puts in the year and the output is the top ten countries by expenditures using this Lynda class as a model.
Here's the data frame
df.dtypes
Country Name object
Country Code object
Year int32
CountryYear object
Population int32
GDP float64
MilExpend float64
Percent float64
dtype: object
Country Name Country Code Year CountryYear Pop GDP Expend Percent
0 Aruba ABW 1960 ABW-1960 54208 0.0 0.0 0.0
I've tried this code and got errors:
Code:
def topten(Year):
simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
simple.index = simple.index + 1
return simple
topten(1990)
This is the rather big error I received:
Can I get some assistance? I can't even figure out what the error is. :-(
C:\Users\mycomputer\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1738 # if kind==mergesort, it can fail for object dtype
-> 1739 return arr.argsort(kind=kind)
1740 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-105-0c974c6a1b44> in <module>()
----> 1 topten(1990)
<ipython-input-104-b8c336014d5b> in topten(Year)
1 def topten(Year):
----> 2 simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
3 simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
4 simple.index = simple.index + 1
5
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort(self, axis, ascending, kind, na_position, inplace)
1831
1832 return self.sort_values(ascending=ascending, kind=kind,
-> 1833 na_position=na_position, inplace=inplace)
1834
1835 def order(self, na_last=None, ascending=True, kind='quicksort',
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort_values(self, axis, ascending, inplace, kind, na_position)
1751 idx = _default_index(len(self))
1752
-> 1753 argsorted = _try_kind_sort(arr[good])
1754
1755 if not ascending:
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1741 # stable sort not available for object dtype
1742 # uses the argsort default quicksort
-> 1743 return arr.argsort(kind='quicksort')
1744
1745 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
The first argument to .loc is the row label.
When you call df_details_merged.loc[1960], pandas will find the row with the label 1960 and return that row as a Series. So you get back a Series with the index Country Name, Country Code, ..., with the values being the values from that row. Then your code tries to sort this by MilExpend, and that's where it fails.
What you need isn't loc, but a simple condition: df[df.Year == Year]. That is "give me the whole dataframe, but only where the 'Year' column contains whatever I've specified in the "Year" variable (1960 in your example).
sort will still work for the time being, but is being deprecated, so use sort_values instead. Putting that together:
simple = df_details_merged[df_details_merged.Year == Year].sort_values(by='MilExpend', ascending=False).reset_index()
Then you can go ahead and drop the columns, and fetch the top 10 rows as you're doing now.

python pandas functions with and without parentheses

I notice that many DataFrame functions if used without parentheses seem to behave like 'properties' e.g.
In [200]: df = DataFrame (np.random.randn (7,2))
In [201]: df.head ()
Out[201]:
0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
In [202]: df.head
Out[202]:
<bound method DataFrame.head of 0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
5 0.940440 -2.279781
6 1.152370 -2.733546>
How is this done and is it good practice ?
This is with pandas 0.15.1 on linux
They are different and not recommended, one clearly shows that it's a method and happens to output the results whilst the other shows the expected output.
Here's why you should not do this:
In [23]:
t = df.head
In [24]:
t.iloc[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-24-b523e5ce509d> in <module>()
----> 1 t.iloc[0]
AttributeError: 'function' object has no attribute 'iloc'
In [25]:
t = df.head()
t.iloc[0]
Out[25]:
0 0.712635
1 0.363903
Name: 0, dtype: float64
So OK you don't use parentheses to call the method correctly and see an output that appears valid but if you took a reference to this and tried to use it, you are operating on the method rather than the slice of the df which is not what you intended.

syntax issue with python groupby using a variable mixed with a list or string

I am trying to run a groupby that includes a variable and a string combined to use as the grouped fields/columns. can someone help with the syntax it is one of those things that will probably take me a day to figure out.
Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
So the below works:
dfJoinsP2 = dfJoinsP2.groupby(Mix)['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
But when I try and add an extra field called 'Period_Number' I get an error.
dfJoinsP2 = dfJoinsP2.groupby(Mix,'Period_Number')['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
Just to recreate and illustrate your problem:
In [22]:
# define our cols, create a dummy df
cols = ['business_unit','isoname','planning_channel','is_tracked','planning_partner','week','joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h', 'Period Number']
df = pd.DataFrame(columns=cols, data =np.random.randn(5, len(cols)))
df
Out[22]:
business_unit isoname planning_channel is_tracked planning_partner \
0 -0.818644 1.150678 -0.860677 -0.333496 -0.292689
1 0.476575 -0.018507 -1.917119 0.360656 0.381106
2 1.187570 1.105363 1.955066 0.154020 1.996389
3 0.318762 0.962469 0.565538 0.671002 -0.675688
4 -0.070671 -1.717793 -0.085815 0.089589 0.892412
week joined_subs_cmap initial_billed_subs billed_d1 churn_d1 \
0 -0.681875 1.138119 -1.071672 0.409712 -1.066456
1 -0.235040 0.559950 0.082890 -0.372671 0.804438
2 1.707340 0.893437 0.316266 1.852508 -2.554488
3 -2.055322 1.848388 -1.695563 -0.826089 -0.588229
4 -0.325098 0.827455 0.535827 -0.930963 0.211628
churn_24h Period Number
0 1.067530 0.377579
1 0.097042 -1.947681
2 -0.327243 -1.137146
3 0.230110 1.470183
4 1.191042 2.167251
In [23]:
# what you are trying to do
Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
df.groupby(Mix, 'Period Number')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-dc75b3902303> in <module>()
1 Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
----> 2 df.groupby(Mix, 'Period Number')
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\pandas\core\generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
2894 if level is None and by is None:
2895 raise TypeError("You have to supply one of 'by' and 'level'")
-> 2896 axis = self._get_axis_number(axis)
2897 return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
2898 sort=sort, group_keys=group_keys, squeeze=squeeze)
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\pandas\core\generic.py in _get_axis_number(self, axis)
294 pass
295 raise ValueError('No axis named {0} for object type {1}'
--> 296 .format(axis, type(self)))
297
298 def _get_axis_name(self, axis):
ValueError: No axis named Period Number for object type <class 'pandas.core.frame.DataFrame'>
So you get a ValueError because 'Period Number' is being interpreted as an axis value which is of course invalid and not what you intended.
The other point here is that the way you defined Mix will result in a tuple, if instead it was a list then we could append the additional column of interest and all would be fine:
In [24]:
Mix = ['business_unit','isoname','planning_channel','is_tracked','planning_partner','week']
Mix.append('Period Number')
df.groupby(Mix)['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
Out[24]:
business_unit isoname planning_channel is_tracked planning_partner \
0 -0.818644 1.150678 -0.860677 -0.333496 -0.292689
1 -0.070671 -1.717793 -0.085815 0.089589 0.892412
2 0.318762 0.962469 0.565538 0.671002 -0.675688
3 0.476575 -0.018507 -1.917119 0.360656 0.381106
4 1.187570 1.105363 1.955066 0.154020 1.996389
week Period Number joined_subs_cmap initial_billed_subs billed_d1 \
0 -0.681875 0.377579 1.138119 -1.071672 0.409712
1 -0.325098 2.167251 0.827455 0.535827 -0.930963
2 -2.055322 1.470183 1.848388 -1.695563 -0.826089
3 -0.235040 -1.947681 0.559950 0.082890 -0.372671
4 1.707340 -1.137146 0.893437 0.316266 1.852508
churn_d1 churn_24h
0 -1.066456 1.067530
1 0.211628 1.191042
2 -0.588229 0.230110
3 0.804438 0.097042
4 -2.554488 -0.327243

Categories