I'm trying to create a function where the user puts in the year and the output is the top ten countries by expenditures using this Lynda class as a model.
Here's the data frame
df.dtypes
Country Name object
Country Code object
Year int32
CountryYear object
Population int32
GDP float64
MilExpend float64
Percent float64
dtype: object
Country Name Country Code Year CountryYear Pop GDP Expend Percent
0 Aruba ABW 1960 ABW-1960 54208 0.0 0.0 0.0
I've tried this code and got errors:
Code:
def topten(Year):
simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
simple.index = simple.index + 1
return simple
topten(1990)
This is the rather big error I received:
Can I get some assistance? I can't even figure out what the error is. :-(
C:\Users\mycomputer\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
from ipykernel import kernelapp as app
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1738 # if kind==mergesort, it can fail for object dtype
-> 1739 return arr.argsort(kind=kind)
1740 except TypeError:
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-105-0c974c6a1b44> in <module>()
----> 1 topten(1990)
<ipython-input-104-b8c336014d5b> in topten(Year)
1 def topten(Year):
----> 2 simple = df_details_merged.loc[Year].sort('MilExpend',ascending=False).reset_index()
3 simple = simple.drop(['Country Code', 'CountryYear'],axis=1).head(10)
4 simple.index = simple.index + 1
5
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort(self, axis, ascending, kind, na_position, inplace)
1831
1832 return self.sort_values(ascending=ascending, kind=kind,
-> 1833 na_position=na_position, inplace=inplace)
1834
1835 def order(self, na_last=None, ascending=True, kind='quicksort',
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in sort_values(self, axis, ascending, inplace, kind, na_position)
1751 idx = _default_index(len(self))
1752
-> 1753 argsorted = _try_kind_sort(arr[good])
1754
1755 if not ascending:
C:\Users\mycomputer\Anaconda3\lib\site-packages\pandas\core\series.py in _try_kind_sort(arr)
1741 # stable sort not available for object dtype
1742 # uses the argsort default quicksort
-> 1743 return arr.argsort(kind='quicksort')
1744
1745 arr = self._values
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'str'
The first argument to .loc is the row label.
When you call df_details_merged.loc[1960], pandas will find the row with the label 1960 and return that row as a Series. So you get back a Series with the index Country Name, Country Code, ..., with the values being the values from that row. Then your code tries to sort this by MilExpend, and that's where it fails.
What you need isn't loc, but a simple condition: df[df.Year == Year]. That is "give me the whole dataframe, but only where the 'Year' column contains whatever I've specified in the "Year" variable (1960 in your example).
sort will still work for the time being, but is being deprecated, so use sort_values instead. Putting that together:
simple = df_details_merged[df_details_merged.Year == Year].sort_values(by='MilExpend', ascending=False).reset_index()
Then you can go ahead and drop the columns, and fetch the top 10 rows as you're doing now.
Related
I am going to describe males and females groups by gender to visualize how Churn Rate (1-Retention Rate) looks like for each value.
My output
df_data.groupby(by='gender')['Churn'].mean()
error
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-46-75992efc6958> in <module>()
----> 1 df_data.groupby(by='gender')['Churn'].mean()
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in mean(self, numeric_only)
1396 "mean",
1397 alt=lambda x, axis: Series(x).mean(numeric_only=numeric_only),
-> 1398 numeric_only=numeric_only,
1399 )
1400
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1051
1052 if len(output) == 0:
-> 1053 raise DataError("No numeric types to aggregate")
1054
1055 return self._wrap_aggregated_output(output, index=self.grouper.result_index)
DataError: No numeric types to aggregate
All your columns, even those that look like numbers, are strings. You must convert "numeric" columns into numeric columns with .astype(int) before applying .mean(). For example:
df.tenure = df.tenure.astype(int)
For your precise answer
df_data.Churn = df_data.Churn.astype(int)
I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?
Here in the below question on stats, I am trying a "two-sample independent t-test" in python.
An analyst at a department store wants to evaluate a recent credit card promotion. To this end, 500 cardholders were randomly selected. Half received an ad promoting a reduced interest rate on purchases made over the next three months, and half received a standard seasonal ad. Is the promotion effective to increase sales?
Below is my code. I am doing some mistake writing the code please help.
from scipy import stats
std_promo = cust[(cust['insert'] == 'Standard')]
new_promo = cust[(cust['insert'] == 'New Promotion')]
print(std_promo.head(3))
print(new_promo.head(3))
id insert dollars
0 148 Standard 2232.771979
2 973 Standard 2327.092181
3 1096 Standard 1280.030541
id insert dollars
1 572 New Promotion 1403.807542
4 1541 New Promotion 1513.563200
5 1947 New Promotion 1729.627996
print (std_promo.mean())
print (new_promo.mean())
id 69003.000000
dollars 1566.389031
dtype: float64
id 64998.244000
dollars 1637.499983
dtype: float64
print (std_promo.std())
print (new_promo.std())
id 37753.106923
dollars 346.673047
dtype: float64
id 38508.218870
dollars 356.703169
dtype: float64
stats.ttest_ind(a= std_promo, b= new_promo, equal_var= True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-76-b40f7d9d7a3e> in <module>
1 stats.ttest_ind(a= std_promo,
----> 2 b= new_promo)
~\Anaconda3\lib\site-packages\scipy\stats\stats.py in ttest_ind(a, b, axis, equal_var, nan_policy)
4163 return Ttest_indResult(np.nan, np.nan)
4164
-> 4165 v1 = np.var(a, axis, ddof=1)
4166 v2 = np.var(b, axis, ddof=1)
4167 n1 = a.shape[axis]
~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in var(a, axis, dtype, out, ddof, keepdims)
3365
3366 return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3367 **kwargs)
3368
3369
~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
108 if isinstance(arrmean, mu.ndarray):
109 arrmean = um.true_divide(
--> 110 arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
111 else:
112 arrmean = arrmean.dtype.type(arrmean / rcount)
TypeError: unsupported operand type(s) for /: 'str' and 'int'
I think you need to change:
stats.ttest_ind(a= std_promo, b= new_promo, equal_var= True)
to
stats.ttest_ind(a= std_promo.dollars, b= new_promo.dollars, equal_var= True)
I created a similar DF as yours, and ran this and it worked using dollars:
Ttest_indResult(statistic=7.144078895160622, pvalue=9.765848295636031e-05)
I am trying to run a groupby that includes a variable and a string combined to use as the grouped fields/columns. can someone help with the syntax it is one of those things that will probably take me a day to figure out.
Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
So the below works:
dfJoinsP2 = dfJoinsP2.groupby(Mix)['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
But when I try and add an extra field called 'Period_Number' I get an error.
dfJoinsP2 = dfJoinsP2.groupby(Mix,'Period_Number')['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
Just to recreate and illustrate your problem:
In [22]:
# define our cols, create a dummy df
cols = ['business_unit','isoname','planning_channel','is_tracked','planning_partner','week','joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h', 'Period Number']
df = pd.DataFrame(columns=cols, data =np.random.randn(5, len(cols)))
df
Out[22]:
business_unit isoname planning_channel is_tracked planning_partner \
0 -0.818644 1.150678 -0.860677 -0.333496 -0.292689
1 0.476575 -0.018507 -1.917119 0.360656 0.381106
2 1.187570 1.105363 1.955066 0.154020 1.996389
3 0.318762 0.962469 0.565538 0.671002 -0.675688
4 -0.070671 -1.717793 -0.085815 0.089589 0.892412
week joined_subs_cmap initial_billed_subs billed_d1 churn_d1 \
0 -0.681875 1.138119 -1.071672 0.409712 -1.066456
1 -0.235040 0.559950 0.082890 -0.372671 0.804438
2 1.707340 0.893437 0.316266 1.852508 -2.554488
3 -2.055322 1.848388 -1.695563 -0.826089 -0.588229
4 -0.325098 0.827455 0.535827 -0.930963 0.211628
churn_24h Period Number
0 1.067530 0.377579
1 0.097042 -1.947681
2 -0.327243 -1.137146
3 0.230110 1.470183
4 1.191042 2.167251
In [23]:
# what you are trying to do
Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
df.groupby(Mix, 'Period Number')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-dc75b3902303> in <module>()
1 Mix ='business_unit','isoname','planning_channel','is_tracked','planning_partner','week'
----> 2 df.groupby(Mix, 'Period Number')
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\pandas\core\generic.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze)
2894 if level is None and by is None:
2895 raise TypeError("You have to supply one of 'by' and 'level'")
-> 2896 axis = self._get_axis_number(axis)
2897 return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
2898 sort=sort, group_keys=group_keys, squeeze=squeeze)
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\pandas\core\generic.py in _get_axis_number(self, axis)
294 pass
295 raise ValueError('No axis named {0} for object type {1}'
--> 296 .format(axis, type(self)))
297
298 def _get_axis_name(self, axis):
ValueError: No axis named Period Number for object type <class 'pandas.core.frame.DataFrame'>
So you get a ValueError because 'Period Number' is being interpreted as an axis value which is of course invalid and not what you intended.
The other point here is that the way you defined Mix will result in a tuple, if instead it was a list then we could append the additional column of interest and all would be fine:
In [24]:
Mix = ['business_unit','isoname','planning_channel','is_tracked','planning_partner','week']
Mix.append('Period Number')
df.groupby(Mix)['joined_subs_cmap', 'initial_billed_subs', 'billed_d1', 'churn_d1' , 'churn_24h'].sum().reset_index()
Out[24]:
business_unit isoname planning_channel is_tracked planning_partner \
0 -0.818644 1.150678 -0.860677 -0.333496 -0.292689
1 -0.070671 -1.717793 -0.085815 0.089589 0.892412
2 0.318762 0.962469 0.565538 0.671002 -0.675688
3 0.476575 -0.018507 -1.917119 0.360656 0.381106
4 1.187570 1.105363 1.955066 0.154020 1.996389
week Period Number joined_subs_cmap initial_billed_subs billed_d1 \
0 -0.681875 0.377579 1.138119 -1.071672 0.409712
1 -0.325098 2.167251 0.827455 0.535827 -0.930963
2 -2.055322 1.470183 1.848388 -1.695563 -0.826089
3 -0.235040 -1.947681 0.559950 0.082890 -0.372671
4 1.707340 -1.137146 0.893437 0.316266 1.852508
churn_d1 churn_24h
0 -1.066456 1.067530
1 0.211628 1.191042
2 -0.588229 0.230110
3 0.804438 0.097042
4 -2.554488 -0.327243
On the following series:
0 1411161507178
1 1411138436009
2 1411123732180
3 1411167606146
4 1411124780140
5 1411159331327
6 1411131745474
7 1411151831454
8 1411152487758
9 1411137160544
Name: my_series, dtype: int64
This command (convert to timestamp, localize and convert to EST) works:
pd.to_datetime(my_series, unit='ms').apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))
but this one fails:
pd.to_datetime(my_series, unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
with:
TypeError Traceback (most recent call last)
<ipython-input-3-58187a4b60f8> in <module>()
----> 1 lua = pd.to_datetime(df[column], unit='ms').tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
and so does this one:
my_series.tz_localize('UTC').tz_convert('US/Eastern')
with:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-0a7cb1e94e1e> in <module>()
----> 1 lua = df[column].tz_localize('UTC').tz_convert('US/Eastern')
/Users/josh/anaconda/envs/py34/lib/python3.4/site-packages/pandas/core/generic.py in tz_localize(self, tz, axis, copy, infer_dst)
3492 ax_name = self._get_axis_name(axis)
3493 raise TypeError('%s is not a valid DatetimeIndex or PeriodIndex' %
-> 3494 ax_name)
3495 else:
3496 ax = DatetimeIndex([],tz=tz)
TypeError: index is not a valid DatetimeIndex or PeriodIndex
As far as I understand, the second approach above (the first one that fails) should work. Why does it fail?
As Jeff's answer mentions, tz_localize() and tz_convert() act on the index, not the data. This was a huge surprise to me too.
Since Jeff's answer was written, Pandas 0.15 added a new Series.dt accessor that helps your use case. You can now do this:
pd.to_datetime(my_series, unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
tz_localize/tz_convert act on the INDEX of the object, not on the values. Easiest to simply turn it into an index then localize and convert. If you then want a Series back you can use to_series()
In [47]: pd.DatetimeIndex(pd.to_datetime(s,unit='ms')).tz_localize('UTC').tz_convert('US/Eastern')
Out[47]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-19 17:18:27.178000-04:00, ..., 2014-09-19 10:32:40.544000-04:00]
Length: 10, Freq: None, Timezone: US/Eastern
this work fine
pd.to_datetime(my_series,unit='ms', utc=True).dt.tz_convert('US/Eastern')