I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.
Related
Maybe the problem is easy but I worked for hours to find the problem but I couldn't find any solution.
In the first part I build a dataframe taking data from HERE. I had some problems to extract what I wanted (UK cumulative covid case day by day) but eventually I managed to get the right shape. My final dataframe size is 1 column x about 600 rows.
Now the code looks like this.
#BUILDING DATAFRAME
pd.set_option('display.max_rows', None)
df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df=df[df["Country/Region"]=="United Kingdom"]
N = 11
df = df.iloc[N: , :]
df=df.drop(columns=["Province/State","Lat","Long","Country/Region"])
df = df.columns.to_frame().T.append(df, ignore_index=True)
df.columns = range(len(df.columns))
df=df.T
df = df.rename(columns={0: 'date', 1: 'nuovi_casi'})
df['nuovi_casi'] = df['nuovi_casi']+10000
M = 1
df = df.iloc[M: , :]
df=df.drop(columns=["date"])
#HERE IS THE PROBLEM
ts = df[['nuovi_casi']].dropna()
sts = ts.nuovi_casi
sts.index.name = None
ts_log = np.log(1+sts).dropna()
I had to add the line df['nuovi_casi'] = df['nuovi_casi']+10000 since my dataframe had some 0 values. So I listed all values and checked with another Python program to be sure and now my dataframe has all values above 10000.
When I run the code, the second part raises an error like this:
AttributeError: 'int' object has no attribute 'log'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\\project_env\UK.py", line 69, in <module>
ts_log = np.log(1+sts).dropna()
File "C:\Users\\project_env\lib\site-packages\pandas\core\generic.py", line
1933, in __array_ufunc__
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "C:\Users\\project_env\lib\site-packages\pandas\core\arraylike.py", line
274, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
TypeError: loop of ufunc does not support argument 0 of type int which has no
callable log method
Then I re-checked all values but they are all positive and the values are not that high since a similar code (which works perfeclty) can handle numbers above 45 millions, and I cannot understand where is the problem.
Can you please find a solution? Thanks!
It appears that sts is a pandas Series
In [72]: import pandas as pd
log works for numeric series:
In [73]: ts = pd.Series([1,2,3])
In [74]: np.log(ts)
Out[74]:
0 0.000000
1 0.693147
2 1.098612
dtype: float64
even None doesn't bother it
In [75]: ts = pd.Series([1,2,3,None])
In [76]: np.log(ts)
Out[76]:
0 0.000000
1 0.693147
2 1.098612
3 NaN
dtype: float64
but put a string in the Series, and you get a object dtype Series/array:
In [77]: ts = pd.Series([1,2,3,'astr'])
In [78]: np.log(ts)
AttributeError: 'int' object has no attribute 'log'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-78-b583cbb218e4>", line 1, in <module>
np.log(ts)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 2032, in __array_ufunc__
return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/arraylike.py", line 364, in array_ufunc
result = getattr(ufunc, method)(*inputs, **kwargs)
TypeError: loop of ufunc does not support argument 0 of type int which has no callable log method
In [79]: ts
Out[79]:
0 1
1 2
2 3
3 astr
dtype: object
I have a simple df:
a = pd.DataFrame([[1,2,3,5,8],['jack','jeff',np.nan,np.nan,'tesla']])
a.index = [['number','name']]
a=a.T
and it looks like this:
number name
0 1 jack
1 2 jeff
2 3 NaN
3 5 NaN
4 8 tesla
When I am tring to do a .loc like a.loc[a['number']==5], I got this type error:
Traceback (most recent call last):
File "c:\Users\Administrator\Documents\proj\test.py", line 13, in <module>
a.loc[a['number']==5]
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2878, in __getitem__
return self._get_item_cache(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3541, in _get_item_cache
values = self._mgr.iget(loc)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 988, in iget
block = self.blocks[self.blknos[i]]
TypeError: only integer scalar arrays can be converted to a scalar index
I searched this error and tried some solutions like using a.loc[np.array(a)['number']==5] or reinstall pandas and numpy or anaconda but they are not working.
My pandas version is 1.3 and numpy version is 1.19.2
The reason being that your column is MultiIndex:
a.columns
#MultiIndex([('number',),
# ( 'name',)],
# )
The error occurs when you do a['number']. Replacing the index rename with a list instead of list of lists should fix, i.e. instead of:
a.index = [['number','name']]
Do:
a.index = ['number','name']
Using Pandas data frame group by feature and I want to group by column c_b and calculate unique count for column c_a and column c_c. My expected results are,
Expected results,
c_b,c_a_unique_count,c_c_unique_count
python,2,2
c++,2,2
Met with strange error about unhashable type, does anyone have any ideas? Thanks.
Input file,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,1.0
go,c++,std,0.0
Source code,
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sampleGroup = sample.groupby('c_b')
results = sampleGroup.count()[:,[0,2]]
results.to_csv(derivedFeatureFile, index= False)
Error message,
Traceback (most recent call last):
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 134, in <module>
unitTest()
File "/Users/foo/personal/featureExtraction/kaggleExercise.py", line 129, in unitTest
results = sampleGroup.count()[:,[0,2]]
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/Users/foo/miniconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1348, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type
For the number of unique elements in each group, you can use:
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
df.groupby('c_b')['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_a c_d
c_b
c++ 2 2
python 2 2
df.groupby('c_b', as_index=False)['c_a', 'c_d'].agg(pd.Series.nunique)
Out:
c_b c_a c_d
0 c++ 2 2
1 python 2 2
What is the best way to use pandas.pivot_table to calculate aggregated functions over the whole table without providing the grouping?
For example, if I want to calculate the sum of A,B,C into one table with a single row without grouping by any of the columsn:
>>> x = pd.DataFrame({'A':[1,2,3],'B':[8,7,6],'C':[0,3,2]})
>>> x
A B C
0 1 8 0
1 2 7 3
2 3 6 2
>>> x.pivot_table(values=['A','B','C'],aggfunc=np.sum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/tools/pivot.py", line 103, in pivot_table
grouped = data.groupby(keys)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/generic.py", line 2434, in groupby
sort=sort, group_keys=group_keys, squeeze=squeeze)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 789, in groupby
return klass(obj, by, **kwds)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 238, in __init__
level=level, sort=sort)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 1622, in _get_grouper
raise ValueError('No group keys passed!')
ValueError: No group keys passed!
Also, I would like to use custom aggfunc, and the above np.sum is just an example.
Thanks.
I think you're asking how to apply a function to all columns of a Data Frame: To do this call the apply method of your dataframe:
def myfunc(col):
return np.sum(col)
x.apply(myfunc)
Out[1]:
A 6
B 21
C 5
dtype: int64
I had the same error, I was using pivot_table argument on a Pandas data frame,
import numpy as np
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales')
# Print mean_sales_by_type
print(mean_sales_by_type)
Here's the error:
File "<stdin>", line 889, in __init__
grouper, exclusions, obj = get_grouper(
File "<stdin>", line 896, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
Finally got it fixed it by specifying the index argument of the pivot_table function (after values)
mean_sales_by_type = sales.pivot_table(values='weekly_sales',index='type')
in your case try this:-
x.pivot_table(values=['A','B','C'],**value=[]**,aggfunc=np.sum)
I have a Python Pandas DataFrame:
df = pd.DataFrame(np.random.rand(5,3),columns=list('ABC'))
print df
A B C
0 0.041761178 0.60439116 0.349372206
1 0.820455992 0.245314299 0.635568504
2 0.517482167 0.7257227 0.982969949
3 0.208934899 0.594973111 0.671030326
4 0.651299752 0.617672419 0.948121305
Question:
I would like to add the first column to the whole dataframe. I would like to get this:
A B C
0 0.083522356 0.646152338 0.391133384
1 1.640911984 1.065770291 1.456024496
2 1.034964334 1.243204867 1.500452116
3 0.417869798 0.80390801 0.879965225
4 1.302599505 1.268972171 1.599421057
For the first row:
A: 0.04176 + 0.04176 = 0.08352
B: 0.04176 + 0.60439 = 0.64615
etc
Requirements:
I cannot refer to the first column using its column name.
eg.: df.A is not acceptable; df.iloc[:,0] is acceptable.
Attempt:
I tried this using:
print df.add(df.iloc[:,0], fill_value=0)
but it is not working. It returns the error message:
Traceback (most recent call last):
File "C:test.py", line 20, in <module>
print df.add(df.iloc[:,0], fill_value=0)
File "C:\python27\lib\site-packages\pandas\core\ops.py", line 771, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2939, in _combine_series
return self._combine_match_columns(other, func, level=level, fill_value=fill_value)
File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2975, in _combine_match_columns
fill_value)
NotImplementedError: fill_value 0 not supported
Is it possible to take the sum of all columns of a DataFrame with the first column?
That's what you need to do:
df.add(df.A, axis=0)
Example:
>>> df = pd.DataFrame(np.random.rand(5,3),columns=['A','B','C'])
>>> col_0 = df.columns.tolist()[0]
>>> print df
A B C
0 0.502962 0.093555 0.854267
1 0.165805 0.263960 0.353374
2 0.386777 0.143079 0.063389
3 0.639575 0.269359 0.681811
4 0.874487 0.992425 0.660696
>>> df = df.add(df.col_0, axis=0)
>>> print df
A B C
0 1.005925 0.596517 1.357229
1 0.331611 0.429766 0.519179
2 0.773553 0.529855 0.450165
3 1.279151 0.908934 1.321386
4 1.748975 1.866912 1.535183
>>>
I would try something like this:
firstol = df.columns[0]
df2 = df.add(df[firstcol], axis=0)
I used a combination of the above two posts to answer this question.
Since I cannot refer to a specific column by its name, I cannot use df.add(df.A, axis=0). But this is along the correct lines. Since df += df[firstcol] produced a dataframe of NaNs, I could not use this approach, but the way that this solution obtains a list of columns from the dataframe was the trick I needed.
Here is how I did it:
col_0 = df.columns.tolist()[0]
print(df.add(df[col_0], axis=0))
You can use numpy and broadcasting for this:
df = pd.DataFrame(df.values + df['A'].values[:, None],
columns=df.columns)
I expect this to be more efficient than series-based methods.