pandas group by confusion -- unhashable type - python

Using Pandas data frame group by feature and I want to group by column c_b and calculate unique count for column c_a and column c_c. My expected results are,
Expected results,
c_b,c_a_unique_count,c_c_unique_count
python,2,2
c++,2,2
Met with strange error about unhashable type, does anyone have any ideas? Thanks.
Input file,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,1.0
go,c++,std,0.0
Source code,
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sampleGroup = sample.groupby('c_b')
results = sampleGroup.count()[:,[0,2]]
results.to_csv(derivedFeatureFile, index= False)
Error message,
Traceback (most recent call last):
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 134, in <module>
unitTest()
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 129, in unitTest
results = sampleGroup.count()[:,[0,2]]
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1348, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type

For the number of unique elements in each group, you can use:
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_a c_d
c_b
c++ 2 2
python 2 2
df.groupby('c_b', as_index=False)['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_b c_a c_d
0 c++ 2 2
1 python 2 2

Related

Type error: only integer scalar arrays can be converted to a scalar index when doing .loc with pandas DataFrame

I have a simple df:
a = pd.DataFrame([[1,2,3,5,8],['jack','jeff',np.nan,np.nan,'tesla']])
a.index = [['number','name']]
a=a.T
and it looks like this:
number name
0 1 jack
1 2 jeff
2 3 NaN
3 5 NaN
4 8 tesla
When I am tring to do a .loc like a.loc[a['number']==5], I got this type error:
Traceback (most recent call last):
File "c:\Users\Administrator\Documents\proj\test.py", line 13, in <module>
a.loc[a['number']==5]
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2878, in __getitem__
return self._get_item_cache(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3541, in _get_item_cache
values = self._mgr.iget(loc)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 988, in iget
block = self.blocks[self.blknos[i]]
TypeError: only integer scalar arrays can be converted to a scalar index
I searched this error and tried some solutions like using a.loc[np.array(a)['number']==5] or reinstall pandas and numpy or anaconda but they are not working.
My pandas version is 1.3 and numpy version is 1.19.2
The reason being that your column is MultiIndex:
a.columns
#MultiIndex([('number',),
# ( 'name',)],
# )
The error occurs when you do a['number']. Replacing the index rename with a list instead of list of lists should fix, i.e. instead of:
a.index = [['number','name']]
Do:
a.index = ['number','name']

DataFrame.apply with str.extract throws error even though function works on each column-series

With this example DataFrame: df = pd.DataFrame([['A-3', 'B-4'], ['C-box', 'D1-go']])
Calling extract on individual columns as series works fine:
df.iloc[:, 0].str.extract('-(.+)')
df.iloc[:, 1].str.extract('-(.+)')
and also on the other axis:
df.iloc[0, :].str.extract('-(.+)')
df.iloc[1, :].str.extract('-(.+)')
So, I'd expect using apply would work (by applying extract to each column):
df.apply(lambda s: s.str.extract('-(.+)'), axis=0)
But it throws this error:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-588-70b1808d5457>", line 2, in <module>
df.apply(lambda s: s.str.extract('-(.+)'))
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 260, in apply_standard
return self.wrap_results()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 308, in wrap_results
return self.wrap_results_for_axis()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 340, in wrap_results_for_axis
result = self.obj._constructor(data=results)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 308, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Using axis=1 yields an unexpected result, a Series with each row being a Series:
Out[2]:
0 0
0 3
1 4
1 0
0 box
1 go
dtype: object
I'm using apply, because I think this would result in the fastest execution time, but open to other suggestions
You can use split instead.
df.apply(lambda s: s.str.split('-', expand=True)[1])
Out[1]:
0 1
0 3 4
1 box go
The default parameter for expand in str.extract is True and it returns a Dataframe. Since you are applying it to multiple columns, it tries to return multiple dataframes. Set expand to False to handle that,
df.apply(lambda x: x.str.extract('-(.*)', expand = False))
0 1
0 3 4
1 box go

Python pandas, errors using .str.contains to search dataframe column for substring

I'd appreciate any help you could offer on this issue using python 2.7, pandas 0.22, and easygui 0.98.1.
I'm trying to load a csv into pandas, assign column names from a user-chosen list using easygui (returns as a list of strings, I think) and search in a certain column of the dataframe for a substring.
import easygui as eg
import pandas as pd
# define vars_vars from choices of imagej outputs
vars_vars = eg.multchoicebox
(msg="\n\n\n\nPlease highlight variables included in ImageJ analysis:",
title="IF Analysis - 2017",
choices=["integrated density", "mean",
"mean grey value", "area fraction"])
# add required imagej columns for later processing
vars_vars.insert(0, "label")
vars_vars.insert(0, "#")
#User input for csv file
file = eg.fileopenbox()
# load into dataframe using pandas and assign columns using chosen variables
df = pd.read_csv(file, header=None, names=None)
df.columns = vars_vars
# Search 'label' column for certain substring
df[df['label'].str.contains('substring')]
But I'm getting this error:
Traceback (most recent call last):
File "C:/Users/User/.PyCharmCE2017.3/config/scratches/scratch.py", line 61, in <module>
df[df['label'].str.contains('nsv')]
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
I tried goolging it, and found these fixes
df['label'] = df['label'].map(str)
df['label'] = df['label'].astype(str)
df['label'] = df['label'].astype(baseline)
and variations of those where I call the entire dataframe rather than df['label'].
However, these all result in no error following pass of that fix line, but invariably returns a similar error as previously when I do the .str.contains line which states dataframe object has no attribute map (for .map(str)) or str (for .astype(x)) lines.
print type(df['label'])
print df.dtypes
print df['label'].head()
print (df['label'].info())
print type(df['label'][0])
returns
<class 'pandas.core.frame.DataFrame'>
# int64
label object
integrated density int64
dtype: object
label
0 nc4_al1_I+pP_4x_contra_ctx_blue.tif
1 nc4_al1_I+pP_4x_contra_ctx_green.tif
2 nc4_al1_I1+pP_4x_contra_ctx_red.tif
3 nc4_al1_I+pP_4x_contra_hc_blue.tif
4 nc4_al1_I+pP_4x_contra_hc_green.tif
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 695 entries, 0 to 694
Data columns (total 1 columns):
(label,) 695 non-null object
dtypes: object(1)
memory usage: 5.5+ KB
None
Traceback (most recent call last):
File "C:/Users/Shon/.PyCharmCE2017.3/config/scratches/scratch.py", line 67, in <module>
print type(df['label'][0])
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\frame.py", line 2137, in __getitem__
return self._getitem_multilevel(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\frame.py", line 2181, in _getitem_multilevel
loc = self.columns.get_loc(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\multi.py", line 2072, in get_loc
loc = self._get_level_indexer(key, level=0)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\multi.py", line 2362, in _get_level_indexer
loc = level_index.get_loc(key)
File "C:\Users\User\Miniconda2\envs\test2\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
I'd greatly appreciate any help you guys could offer.
UPDATE:
Turns out the issue was having a multiindex dataframe.
df.columns.map(''.join).str.strip()
Resulted in changing the column into a series upon printing its type, which enabled me to correctly .str.contains the data.
Looking at the output from type(df['label']).info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 695 entries, 0 to 694
Data columns (total 1 columns):
(label,) 695 non-null object
dtypes: object(1) memory usage: 5.5+ KB None
It appears as if the column from this dataframe is a tuple with an empty element, which indicates a multiindex column heading where the second level is empty.
I recommend first we try to fix the pd.read_cv prevent the creation of a multiindex. However, using:
df.columns = df.columns.map(''.join).str.strip()
Will flatten that multiindex column header to a single level column index and create a pd.Series datatype for df['label'] then allowing the usage of string accessor and contains.

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.

pandas pivot_table without grouping

What is the best way to use pandas.pivot_table to calculate aggregated functions over the whole table without providing the grouping?
For example, if I want to calculate the sum of A,B,C into one table with a single row without grouping by any of the columsn:
>>> x = pd.DataFrame({'A':[1,2,3],'B':[8,7,6],'C':[0,3,2]})
>>> x
A B C
0 1 8 0
1 2 7 3
2 3 6 2
>>> x.pivot_table(values=['A','B','C'],aggfunc=np.sum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/tools/pivot.py", line 103, in pivot_table
grouped = data.groupby(keys)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/generic.py", line 2434, in groupby
sort=sort, group_keys=group_keys, squeeze=squeeze)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 789, in groupby
return klass(obj, by, **kwds)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 238, in __init__
level=level, sort=sort)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 1622, in _get_grouper
raise ValueError('No group keys passed!')
ValueError: No group keys passed!
Also, I would like to use custom aggfunc, and the above np.sum is just an example.
Thanks.
I think you're asking how to apply a function to all columns of a Data Frame: To do this call the apply method of your dataframe:
def myfunc(col):
return np.sum(col)
x.apply(myfunc)
Out[1]:
A 6
B 21
C 5
dtype: int64
I had the same error, I was using pivot_table argument on a Pandas data frame,
import numpy as np
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales')
# Print mean_sales_by_type
print(mean_sales_by_type)
Here's the error:
File "<stdin>", line 889, in __init__
grouper, exclusions, obj = get_grouper(
File "<stdin>", line 896, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
Finally got it fixed it by specifying the index argument of the pivot_table function (after values)
mean_sales_by_type = sales.pivot_table(values='weekly_sales',index='type')
in your case try this:-
x.pivot_table(values=['A','B','C'],**value=[]**,aggfunc=np.sum)

Categories