replacing pandas dataframe variable values with a numpy array - python

I am doing a transformation on a variable from a pandas dataframe and then I would like to replace the column with my new values. The problem seems to be that after the transformation, the length of the array is not the same as the length of my dataframe's index. I don't think that is true though.
>>> df['variable'] = stats.boxcox(df.variable)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2119, in __setitem__
self._set_item(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2165, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2205, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
When I check the length, these lengths seem to disagree. The len(array) says it is 2 but when I call the stats.boxcox it says it is 50000. What is going on here?
>>> len(df)
50000
>>> len(stats.boxcox(df.variable))
2
>>> stats.boxcox(df.variable)
(0 -0.079496
1 -0.117982
2 -0.104637
...
49985 -0.041300
49986 0.651771
49987 -0.115660
49988 -0.118034
49998 -0.118014
49999 -0.034076
Name: feat9, Length: 50000, dtype: float64, 8.4721358117221772)
>>>

You can see in your example that the result of boxcox is a tuple. This is consistent with the documentation, which indicates that boxcox returns a tuple of the transformed data and a lambda value. Notice in the example on that page that it does:
xt, _ = stats.boxcox(x)
. . . showing again that boxcox returns a 2-tuple.
You should be doing df['variable'] = stats.boxcox(df.variable)[0].

Related

Type error: only integer scalar arrays can be converted to a scalar index when doing .loc with pandas DataFrame

I have a simple df:
a = pd.DataFrame([[1,2,3,5,8],['jack','jeff',np.nan,np.nan,'tesla']])
a.index = [['number','name']]
a=a.T
and it looks like this:
number name
0 1 jack
1 2 jeff
2 3 NaN
3 5 NaN
4 8 tesla
When I am tring to do a .loc like a.loc[a['number']==5], I got this type error:
Traceback (most recent call last):
File "c:\Users\Administrator\Documents\proj\test.py", line 13, in <module>
a.loc[a['number']==5]
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2878, in __getitem__
return self._get_item_cache(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 3541, in _get_item_cache
values = self._mgr.iget(loc)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 988, in iget
block = self.blocks[self.blknos[i]]
TypeError: only integer scalar arrays can be converted to a scalar index
I searched this error and tried some solutions like using a.loc[np.array(a)['number']==5] or reinstall pandas and numpy or anaconda but they are not working.
My pandas version is 1.3 and numpy version is 1.19.2
The reason being that your column is MultiIndex:
a.columns
#MultiIndex([('number',),
# ( 'name',)],
# )
The error occurs when you do a['number']. Replacing the index rename with a list instead of list of lists should fix, i.e. instead of:
a.index = [['number','name']]
Do:
a.index = ['number','name']

Python DataFrame TypeError: only integer scalar arrays can be converted to a scalar index

I know there are several questions about this error already. But in this particular case I'm not sure whether there is already a solution for my problem.
I have this part of code and i want to print the column "y" of the Dataframe df.
The following error occurs:
TypeError: only integer scalar arrays can be converted to a scalar index
labels=[]
xvectors=[]
for i in data:
labels.append(i[0])
xvectors.append(i[1])
X = np.array(xvectors)
y = np.array(labels)
feat_cols = [ 'xvec'+str(i) for i in range(X.shape[1]) ]
print(feat_cols)
df = pd.DataFrame(X,columns=[feat_cols])
df['y']= y
#df['label'] = df['y'].apply(lambda i: str(i))
print(df['y'])
X, y = None, None
Printing the whole DataFrame is possible. This looks like:
xvec0 xvec1 xvec2 xvec3 xvec4 ... xvec508 xvec509 xvec510 xvec511 y
0 3.397163 -1.112423 0.414708 0.563083 1.371336 ... 1.201095 -0.076261 -0.620443 -1.231465 DA01_03
1 0.159473 1.884818 -1.511547 -0.153500 -0.635701 ... -1.217205 -1.922081 0.878613 0.087912 DA01_06
2 1.089404 0.331919 -1.027480 0.594129 -2.473234 ... -3.505570 -3.509632 -0.553128 -0.453307 DA01_10
3 0.183993 -1.741467 -0.142570 -3.158320 4.355789 ... 3.857311 3.142393 0.991663 -2.842322 DA01_14
This is the whole errror message:
print(df['y'])
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 2958, in __getitem__
return self._get_item_cache(key)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 3270, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py", line 960, in get
return self.iget(loc)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py", line 977, in iget
block = self.blocks[self._blknos[i]]
TypeError: only integer scalar arrays can be converted to a scalar index
I think it has something to do with the numpy array.
Thank you in advance!
Ah you pass your columns argument as a list in a list (feat_cols is already of type list). This turns your column headers 2-dimensional: you can see df.info() says it ranges from (xvec0,) to ... instead of xvec0.
Passing columns=feat_cols should do the trick :-)

How to create a diff column with the previous period value in python?

I'm just trying to create a column in my dataframe with the difference of the column value and the same column of the previous month. In case the previous month doesn't exist, don't calculate the difference.
Result table example
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['ID'], df_ranking['DATE'])['POINTS'].shift(1)
But the error message I get is:
Traceback (most recent call last):
File "C:/Users/jhoyo/PycharmProjects/Tennis-Ranking/venv/ranking_2_db.py", line 95, in <module>
df_ranking['cat_race'] = df_ranking.groupby(df_ranking['licencia'], df_ranking['date'])['puntos'].shift(1)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 7629, in groupby
axis = self._get_axis_number(axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 351, in _get_axis_number
axis = cls._AXIS_ALIASES.get(axis, axis)
File "C:\Users\jhoyo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 1816, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
You have to define groupby like this===>
df_ranking['cat_race'] = df_ranking.groupby(['ID','Date'])['POINTS'].shift(1)
Hope it will work

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this:
import dask.dataframe as dd
df = dd.read_csv("opendns-random-domains.txt", header=None, names=['domain_name'])
df['domain_length'] = df.domain_name.map(len)
print(df.head())
print(df.dtypes)
top_3 = df.nlargest(3, 'domain_length')
print(top_3.head())
The file opendns-random-domains.txt contains just a long list of domain names. This is what the output of the above code looks like:
domain_name domain_length
0 webmagnat.ro 12
1 nickelfreesolutions.com 23
2 scheepvaarttelefoongids.nl 26
3 tursan.net 10
4 plannersanonymous.com 21
domain_name object
domain_length float64
dtype: object
Traceback (most recent call last):
File "nlargest_test.py", line 9, in <module>
print(top_3.head())
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 382, in head
result = result.compute()
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 86, in compute
return compute(self, **kwargs)[0]
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 484, in get_async
raise(remote_exception(res, tb))
dask.async.TypeError: Cannot use method 'nlargest' with dtype object
Traceback
---------
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/dask/dataframe/core.py", line 2040, in <lambda>
f = lambda df: df.nlargest(n, columns)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3355, in nlargest
return self._nsorted(columns, n, 'nlargest', keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/frame.py", line 3318, in _nsorted
ser = getattr(self[columns[0]], method)(n, keep=keep)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/series.py", line 1898, in nlargest
return algos.select_n(self, n=n, keep=keep, method='nlargest')
File "/home/work/Dokumente/ModZero/Commerzbank/DNS_und_Proxylog-Analyse/dask-log-analyzer/venv/lib/python3.5/site-packages/pandas/core/algorithms.py", line 559, in select_n
raise TypeError("Cannot use method %r with dtype %s" % (method, dtype))
I'm confused, because I'm calling nlargest on the column which is of type float64 but still get this error saying it cannot be called on dtype object. Also this works fine in pandas. How can I get the n longest entries from a DataFrame?
I was helped by explicit type conversion:
df['column'].astype(str).astype(float).nlargest(5)
This is how my first data frame look.
This is how my new data frame looks after getting top 5.
'''
station_count.nlargest(5,'count')
'''
You have to give (nlargest) command to a column who have int data type and not in string so it can calculate the count.
Always top n number followed by its corresponding column that is int type.
I tried to reproduce your problem but things worked fine. Can I recommend that you produce a Minimal Complete Verifiable Example?
Pandas example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: df['y'] = df.x.map(len)
In [4]: df
Out[4]:
x y
0 a 1
1 bb 2
2 ccc 3
3 dddd 4
In [5]: df.nlargest(3, 'y')
Out[5]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Dask dataframe example
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': ['a', 'bb', 'ccc', 'dddd']})
In [3]: import dask.dataframe as dd
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf['y'] = ddf.x.map(len)
In [6]: ddf.nlargest(3, 'y').compute()
Out[6]:
x y
3 dddd 4
2 ccc 3
1 bb 2
Alternatively, perhaps this is just working now on the git master version?
You only need to change the type of respective column to int or float using .astype().
For example, in your case:
top_3 = df['domain_length'].astype(float).nlargest(3)
If you want to get the values with the most occurrences from a String type column you may use value_counts() with nlargest(n), where n is the number of elements you want to bring.
df['your_column'].value_counts().nlargest(3)
It will bring the top 3 occurrences from that column.

pandas pivot_table without grouping

What is the best way to use pandas.pivot_table to calculate aggregated functions over the whole table without providing the grouping?
For example, if I want to calculate the sum of A,B,C into one table with a single row without grouping by any of the columsn:
>>> x = pd.DataFrame({'A':[1,2,3],'B':[8,7,6],'C':[0,3,2]})
>>> x
A B C
0 1 8 0
1 2 7 3
2 3 6 2
>>> x.pivot_table(values=['A','B','C'],aggfunc=np.sum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/tools/pivot.py", line 103, in pivot_table
grouped = data.groupby(keys)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/generic.py", line 2434, in groupby
sort=sort, group_keys=group_keys, squeeze=squeeze)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 789, in groupby
return klass(obj, by, **kwds)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 238, in __init__
level=level, sort=sort)
File "/tool/pandora64/.package/python-2.7.5/lib/python2.7/site-packages/pandas/core/groupby.py", line 1622, in _get_grouper
raise ValueError('No group keys passed!')
ValueError: No group keys passed!
Also, I would like to use custom aggfunc, and the above np.sum is just an example.
Thanks.
I think you're asking how to apply a function to all columns of a Data Frame: To do this call the apply method of your dataframe:
def myfunc(col):
return np.sum(col)
x.apply(myfunc)
Out[1]:
A 6
B 21
C 5
dtype: int64
I had the same error, I was using pivot_table argument on a Pandas data frame,
import numpy as np
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales')
# Print mean_sales_by_type
print(mean_sales_by_type)
Here's the error:
File "<stdin>", line 889, in __init__
grouper, exclusions, obj = get_grouper(
File "<stdin>", line 896, in get_grouper
raise ValueError("No group keys passed!")
ValueError: No group keys passed!
Finally got it fixed it by specifying the index argument of the pivot_table function (after values)
mean_sales_by_type = sales.pivot_table(values='weekly_sales',index='type')
in your case try this:-
x.pivot_table(values=['A','B','C'],**value=[]**,aggfunc=np.sum)

Categories