What is the best way to use pandas.pivot_table to calculate aggregated functions over the whole table without providing the grouping?
For example, if I want to calculate the sum of A,B,C into one table with a single row without grouping by any of the columsn:
>>> x = pd.DataFrame({'A':[1,2,3],'B':[8,7,6],'C':[0,3,2]})
>>> x
A B C
0 1 8 0
1 2 7 3
2 3 6 2
>>> x.pivot_table(values=['A','B','C'],aggfunc=np.sum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/tools/pivot.py", line 103, in pivot_table
grouped = data.groupby(keys)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/generic.py", line 2434, in groupby
sort=sort, group_keys=group_keys, squeeze=squeeze)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 789, in groupby
return klass(obj, by, **kwds)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 238, in __init__
level=level, sort=sort)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 1622, in _get_grouper
raise ValueError('No group keys passed!')
ValueError: No group keys passed!
Also, I would like to use custom aggfunc, and the above np.sum is just an example.
Thanks.
I think you're asking how to apply a function to all columns of a Data Frame: To do this call the apply method of your dataframe:
def myfunc(col):
return np.sum(col)
x.apply(myfunc)
Out[1]:
A 6
B 21
C 5
dtype: int64
I had the same error, I was using pivot_table argument on a Pandas data frame,
import numpy as np
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales')
# Print mean_sales_by_type
print(mean_sales_by_type)
Here's the error:
File "<stdin>", line 889, in __init__
grouper, exclusions, obj = get_grouper(
File "<stdin>", line 896, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
Finally got it fixed it by specifying the index argument of the pivot_table function (after values)
mean_sales_by_type = sales.pivot_table(values='weekly_sales',index='type')
in your case try this:-
x.pivot_table(values=['A','B','C'],**value=[]**,aggfunc=np.sum)
Related
So I have a DataFrame with about 400,000 columns. When I try to get all the data using iloc, it throws out of bound errors. Here is what I have tried.
index_second_update = the_data.index.tolist()
the_data.iloc[index_second_update]
Traceback (most recent call last):
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2130, in _get_list_axis
return self.obj.take(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/generic.py",
line 3604, in take
indices, axis=self._get_block_manager_axis(axis), verify=True
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py",
line 1389, in take
indexer = maybe_convert_indices(indexer, n)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexers.py",
line 201, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 1424, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2148, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/home/dev/.local/lib/python3.6/site-packages/pandas/core/indexing.py",
line 2133, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds
Some more details:
len(index_second_update) = 446882
index_second_update == the_data.index.tolist()
True
Strange thing is that it breaks down at around 200000 rows. Up until then it works perfectly fine.
df.loc access the pandas by the label of each row, which is not necessarily the row number.
here's code that will work for you, that accesses the data by the row label
index_second_update = the_data.index.tolist()
the_data.loc[index_second_update]
or even more simply:
the_data.loc[the_data.index]
as an example for an index which is not row numbers look in the dataframe below, the rows are labeled by name.
import pandas as pd
csv = """\
Name,Birth Year
Joe,2000
Bill,1998
Mike,1996
Frank,1995"""
from io import StringIO
df = pd.read_csv(StringIO(csv))
df.set_index('Name')
Birth Year
Name
Joe 2000
Bill 1998
Mike 1996
Frank 1995
With this example DataFrame: df = pd.DataFrame([['A-3', 'B-4'], ['C-box', 'D1-go']])
Calling extract on individual columns as series works fine:
df.iloc[:, 0].str.extract('-(.+)')
df.iloc[:, 1].str.extract('-(.+)')
and also on the other axis:
df.iloc[0, :].str.extract('-(.+)')
df.iloc[1, :].str.extract('-(.+)')
So, I'd expect using apply would work (by applying extract to each column):
df.apply(lambda s: s.str.extract('-(.+)'), axis=0)
But it throws this error:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-588-70b1808d5457>", line 2, in <module>
df.apply(lambda s: s.str.extract('-(.+)'))
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 260, in apply_standard
return self.wrap_results()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 308, in wrap_results
return self.wrap_results_for_axis()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 340, in wrap_results_for_axis
result = self.obj._constructor(data=results)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 308, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Using axis=1 yields an unexpected result, a Series with each row being a Series:
Out[2]:
0 0
0 3
1 4
1 0
0 box
1 go
dtype: object
I'm using apply, because I think this would result in the fastest execution time, but open to other suggestions
You can use split instead.
df.apply(lambda s: s.str.split('-', expand=True)[1])
Out[1]:
0 1
0 3 4
1 box go
The default parameter for expand in str.extract is True and it returns a Dataframe. Since you are applying it to multiple columns, it tries to return multiple dataframes. Set expand to False to handle that,
df.apply(lambda x: x.str.extract('-(.*)', expand = False))
0 1
0 3 4
1 box go
Using Pandas data frame group by feature and I want to group by column c_b and calculate unique count for column c_a and column c_c. My expected results are,
Expected results,
c_b,c_a_unique_count,c_c_unique_count
python,2,2
c++,2,2
Met with strange error about unhashable type, does anyone have any ideas? Thanks.
Input file,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,1.0
go,c++,std,0.0
Source code,
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sampleGroup = sample.groupby('c_b')
results = sampleGroup.count()[:,[0,2]]
results.to_csv(derivedFeatureFile, index= False)
Error message,
Traceback (most recent call last):
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 134, in <module>
unitTest()
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 129, in unitTest
results = sampleGroup.count()[:,[0,2]]
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1348, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
For the number of unique elements in each group, you can use:
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_a c_d
c_b
c++ 2 2
python 2 2
df.groupby('c_b', as_index=False)['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_b c_a c_d
0 c++ 2 2
1 python 2 2
I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.
I am doing a transformation on a variable from a pandas dataframe and then I would like to replace the column with my new values. The problem seems to be that after the transformation, the length of the array is not the same as the length of my dataframe's index. I don't think that is true though.
>>> df['variable'] = stats.boxcox(df.variable)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2119, in __setitem__
self._set_item(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2165, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2205, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
When I check the length, these lengths seem to disagree. The len(array) says it is 2 but when I call the stats.boxcox it says it is 50000. What is going on here?
>>> len(df)
50000
>>> len(stats.boxcox(df.variable))
2
>>> stats.boxcox(df.variable)
(0 -0.079496
1 -0.117982
2 -0.104637
...
49985 -0.041300
49986 0.651771
49987 -0.115660
49988 -0.118034
49998 -0.118014
49999 -0.034076
Name: feat9, Length: 50000, dtype: float64, 8.4721358117221772)
>>>
You can see in your example that the result of boxcox is a tuple. This is consistent with the documentation, which indicates that boxcox returns a tuple of the transformed data and a lambda value. Notice in the example on that page that it does:
xt, _ = stats.boxcox(x)
. . . showing again that boxcox returns a 2-tuple.
You should be doing df['variable'] = stats.boxcox(df.variable)[0].