Apply a function from a groupby transform

Apply a function from a groupby transform - python

My pandas looks like this
Date Ticker Open High Low Adj Close Adj_Close Volume
2016-04-18 vws.co 445.0 449.2 441.7 447.3 447.3 945300
2016-04-19 vws.co 449.0 455.8 448.3 450.9 450.9 907700
2016-04-20 vws.co 451.0 452.5 435.4 436.6 436.6 1268100
2016-04-21 vws.co 440.1 442.9 428.4 435.5 435.5 1308300
2016-04-22 vws.co 435.5 435.5 435.5 435.5 435.5 0
2016-04-25 vws.co 431.0 436.7 424.4 430.0 430.0 1311700
2016-04-18 nflx 109.9 110.7 106.02 108.4 108.4 27001500
2016-04-19 nflx 99.49 101.37 94.2 94.34 94.34 55623900
2016-04-20 nflx 94.34 96.98 93.14 96.77 96.77 25633600
2016-04-21 nflx 97.31 97.38 94.78 94.98 94.98 19859400
2016-04-22 nflx 94.85 96.69 94.21 95.9 95.9 15786000
2016-04-25 nflx 95.7 95.75 92.8 93.56 93.56 14965500
I have a program that at one of the functions with embedded functions sucessfully runs a groupby.
This line looks like this
df['MA3'] = df.groupby('Ticker').Adj_Close.transform(lambda group: pd.rolling_mean(group, window=3))
Se my initial question and the data-format here:
Select only one value in df col rows in same df for calc results from different val, and calc df only on one ticker at a time
It has now dawned on me that rather than doing the groupby in each embedded function of which I have 5, I would rather have the groupby run in the main program calling the top function, so all the embedded functions could work on the filtered groupby pandas dataframe from only doing the groupby once...
How do I apply my main function with groupby, in order to filter my pandas, so I only work on one ticker (value in col 'Ticker') at a time?
The 'Ticker' col contains 'aapl', 'msft', 'nflx' company identifyers etc, with timeseries data for a time-window.
Thanks a lot Karasinski. This is close to what I want. But I get an errror.
When I run:
def Screener(df_all, group):
# Copy df_all to df for single ticker operations
df = df_all.copy()
def diff_calc(df,ticker):
df['Difference'] = df['Adj_Close'].diff()
return df
df = diff_calc(df, ticker)
return df_all
for ticker in stocklist:
df_all[['Difference']] = df_all.groupby('Ticker').Adj_Close.apply(Screener, ticker)
I get this error:
Traceback (most recent call last):
File "<ipython-input-2-d7c1835f6b2a>", line 1, in <module>
runfile('C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py', wdir='C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox')
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 144, in <module>
df_all[['Difference']] = df_all.groupby('Ticker').Adj_Close.apply(Screener, ticker)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 663, in apply
return self._python_apply_general(f)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 667, in _python_apply_general
self.axis)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 1286, in apply
res = f(group)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 659, in f
return func(g, *args, **kwargs)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 112, in Screener
df = diff_calc(df, ticker)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 70, in diff_calc
df['Difference'] = df['Adj_Close'].diff()
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\series.py", line 514, in __getitem__
result = self.index.get_value(self, key)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\tseries\index.py", line 1221, in get_value
raise KeyError(key)
KeyError: 'Adj_Close'
And when I use functools like so
df_all = functools.partial(df_all.groupby('Ticker').Adj_Close.apply(Screener, ticker))
I get the same error as above...
Traceback (most recent call last):
File "<ipython-input-5-d7c1835f6b2a>", line 1, in <module>
runfile('C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py', wdir='C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox')
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 148, in <module>
df_all = functools.partial(df_all.groupby('Ticker').Adj_Close.apply(Screener, [ticker]))
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 663, in apply
return self._python_apply_general(f)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 667, in _python_apply_general
self.axis)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 1286, in apply
res = f(group)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 659, in f
return func(g, *args, **kwargs)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 114, in Screener
df = diff_calc(df, ticker)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 72, in diff_calc
df['Difference'] = df['Adj_Close'].diff()
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\series.py", line 514, in __getitem__
result = self.index.get_value(self, key)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-
3.3.5.amd64\lib\site-packages\pandas\tseries\index.py", line 1221, in get_value
raise KeyError(key)
KeyError: 'Adj_Close'
Edit from Karasinski's edit from 31/5.
When I run the last suggestion from Karasinski I get this error.
mmm
mmm
nflx
vws.co
Traceback (most recent call last):
File "<ipython-input-4-d7c1835f6b2a>", line 1, in <module>
runfile('C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py', wdir='C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox')
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/Morten/Documents/Design/Python/CrystalBall - Local - Git/Git - CrystalBall/sandbox/screener_test simple for StockOverflowNestedFct_Getstock.py", line 173, in <module>
df_all[['mean', 'max', 'median', 'min']] = df_all.groupby('Ticker').apply(group_func)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 663, in apply
return self._python_apply_general(f)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 670, in _python_apply_general
not_indexed_same=mutated)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 2785, in _wrap_applied_output
not_indexed_same=not_indexed_same)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\groupby.py", line 1142, in _concat_objects
result = result.reindex_axis(ax, axis=self.axis)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\frame.py", line 2508, in reindex_axis
fill_value=fill_value)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\generic.py", line 1841, in reindex_axis
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\generic.py", line 1865, in _reindex_with_indexers
copy=copy)
File "C:\Program Files\WinPython-64bit-3.3.5.7\python-3.3.5.amd64\lib\site-packages\pandas\core\internals.py", line 3144, in reindex_indexer
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

From an answer from your previous question we can set up with
import pandas as pd
from StringIO import StringIO
text = """Date Ticker Open High Low Adj_Close Volume
2015-04-09 vws.co 315.000000 316.100000 312.500000 311.520000 1686800
2015-04-10 vws.co 317.000000 319.700000 316.400000 312.700000 1396500
2015-04-13 vws.co 317.900000 321.500000 315.200000 315.850000 1564500
2015-04-14 vws.co 320.000000 322.400000 318.700000 314.870000 1370600
2015-04-15 vws.co 320.000000 321.500000 319.200000 316.150000 945000
2015-04-16 vws.co 319.000000 320.200000 310.400000 307.870000 2236100
2015-04-17 vws.co 309.900000 310.000000 302.500000 299.100000 2711900
2015-04-20 vws.co 303.000000 312.000000 303.000000 306.490000 1629700
2016-03-31 mmm 166.750000 167.500000 166.500000 166.630005 1762800
2016-04-01 mmm 165.630005 167.740005 164.789993 167.529999 1993700
2016-04-04 mmm 167.110001 167.490005 165.919998 166.399994 2022800
2016-04-05 mmm 165.179993 166.550003 164.649994 165.809998 1610300
2016-04-06 mmm 165.339996 167.080002 164.839996 166.809998 2092200
2016-04-07 mmm 165.880005 167.229996 165.250000 167.160004 2721900"""
df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)
You can then make a function which calculates whatever statistics you'd like, such as:
def various_indicators(group):
mean = pd.rolling_mean(group, window=3)
max = pd.rolling_max(group, window=3)
median = pd.rolling_median(group, window=3)
min = pd.rolling_min(group, window=3)
return pd.DataFrame({'mean': mean,
'max': max,
'median': median,
'min': min})
To assign these new columns to your dataframe, you would then do a groupby and then apply the function by
df[['mean', 'max', 'median', 'min']] = df.groupby('Ticker').Adj_Close.apply(various_indicators)
EDIT
In regards to your further questions in the comments of the answer: To extract additional information from the dataframe, you should instead pass the entire group rather than just the single column.
def group_func(group):
ticker = group.Ticker.unique()[0]
adj_close = group.Adj_Close
return Screener(ticker, adj_close)
def Screener(ticker, adj_close):
print(ticker)
mean = pd.rolling_mean(adj_close, window=3)
max = pd.rolling_max(adj_close, window=3)
median = pd.rolling_median(adj_close, window=3)
min = pd.rolling_min(adj_close, window=3)
return pd.DataFrame({'mean': mean,
'max': max,
'median': median,
'min': min})
You can then assign these columns in a similar way as above
df[['mean', 'max', 'median', 'min']] = df.groupby('Ticker').apply(group_func)

Related

Why does pandas market calendar give "ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead."?

The pandas-market-calendars module documentation gives a quickstart example of how to create a schedule here: https://pypi.org/project/pandas-market-calendars/
I copy-pasted their quickstart example:
import pandas_market_calendars as mcal
# Create a calendar
nyse = mcal.get_calendar('NYSE')
# Show available calendars
print(mcal.get_calendar_names())
early = nyse.schedule(start_date='2012-07-01', end_date='2012-07-10')
print(early)
In their example, we should get a nice pretty schedule output:
market_open market_close
=========== ========================= =========================
2012-07-02 2012-07-02 13:30:00+00:00 2012-07-02 20:00:00+00:00
2012-07-03 2012-07-03 13:30:00+00:00 2012-07-03 17:00:00+00:00
2012-07-05 2012-07-05 13:30:00+00:00 2012-07-05 20:00:00+00:00
2012-07-06 2012-07-06 13:30:00+00:00 2012-07-06 20:00:00+00:00
2012-07-09 2012-07-09 13:30:00+00:00 2012-07-09 20:00:00+00:00
2012-07-10 2012-07-10 13:30:00+00:00 2012-07-10 20:00:00+00:00
instead, I get an error:
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.
What am I doing wrong? I thought I input the dates exactly like the example showed. Why is it requiring a different format? Some people have ran the above code without error. My traceback is:
Traceback (most recent call last):
File "D:/Strawberry/Strawberry_Omega/tests/pandas_mkt_clndr_test.py", line 9, in <module>
early = nyse.schedule(start_date='2012-07-01', end_date='2012-07-10')
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas_market_calendars\market_calendar.py", line 632, in schedule
adjusted = schedule.loc[_close_adj].apply(
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py", line 7547, in apply
return op.get_result()
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py", line 180, in get_result
return self.apply_standard()
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py", line 255, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\apply.py", line 284, in apply_series_generator
results[i] = self.f(v)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas_market_calendars\market_calendar.py", line 633, in <lambda>
lambda x: x.where(x.le(x["market_close"]), x["market_close"]), axis= 1)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 9001, in where
return self._where(
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 8741, in _where
cond, _ = cond.align(self, join="right", broadcast_axis=1)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\series.py", line 4274, in align
return super().align(
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 8556, in align
return self._align_series(
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 8664, in _align_series
right = other._reindex_indexer(join_index, ridx, copy)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\series.py", line 4241, in _reindex_indexer
return self.copy()
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 5660, in copy
data = self._mgr.copy(deep=deep)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\managers.py", line 802, in copy
res = self.apply("copy", deep=deep)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\managers.py", line 406, in apply
applied = getattr(b, f)(**kwargs)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 679, in copy
return self.make_block_same_class(values, ndim=self.ndim)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 261, in make_block_same_class
return type(self)(values, placement=placement, ndim=ndim)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 1544, in __init__
values = self._maybe_coerce_values(values)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 2170, in _maybe_coerce_values
values = self._holder(values)
File "C:\Users\Wes\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\arrays\datetimes.py", line 254, in __init__
raise ValueError(
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.

Handling OutOfBoundsDatetime error in pandas

I am trying to apply few functions on a pandas data frame, but am getting OutOfBoundsDatetime error -
def data_cleanser(data):
clean_df = data.str.strip().str.replace(r'\\', '')
return clean_df
connection = pyodbc.connect(conn)
sql = 'SELECT * FROM {}'.format(tablename)
df = pd.read_sql_query(sql, connection)
df.replace([None], np.nan, inplace=True)
df.fillna('', inplace=True)
df=df.applymap(str)
df = df.apply(lambda x: data_cleanser(x))
Error message:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/internals.py", line 763, in replace
copy=not inplace) for b in blocks]
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/internals.py", line 763, in <listcomp>
copy=not inplace) for b in blocks]
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/internals.py", line 2135, in convert
blocks = self.split_and_operate(None, f, False)
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/internals.py", line 478, in split_and_operate
nv = f(m, v, i)
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/internals.py", line 2125, in f
values = fn(v.ravel(), **fn_kwargs)
File "/Python_Scripts/lib64/python3.4/site-packages/pandas/core/dtypes/cast.py", line 807, in soft_convert_objects
values = lib.maybe_convert_objects(values, convert_datetime=datetime)
File "pandas/_libs/src/inference.pyx", line 1290, in pandas._libs.lib.maybe_convert_objects
File "pandas/_libs/tslib.pyx", line 1575, in pandas._libs.tslib.convert_to_tsobject
File "pandas/_libs/tslib.pyx", line 1669, in pandas._libs.tslib.convert_datetime_to_tsobject
File "pandas/_libs/tslib.pyx", line 1848, in pandas._libs.tslib._check_dts_bounds
pandas._libs.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1-01-01 07:00:00
Sample data frame:
appointment_id
start_time
end_time
emp_id
302205
2016-10-26 17:30:00
2016-10-26 18:30:00
45807
462501
2017-04-10 13:00:00
NaT
45807
How can i avoid this error?

MemoryError, pandas read_csv, 32 bit, dont want to use chunksize

I am recently upgrading from python 2 to python 3.(anaconda 32-bit)
Upgrading to 64-bit is not an option at the moment.
Upon calling my function, i am getting a MemoryError
def M2():
print ('Loading datasets...')
e1 = pd.read_csv(working_dir+"E1.txt",sep=',')
E1.txt is 300,000 kb.
Is there a better way of reading in this data?
Update
I do not want to use chunksize as this will not read my data in a dataframe.
I have narrowed down my .txt file from 300k kb, to 50k kb and still memory issue.
Traceback:
Traceback (most recent call last):
File "<ipython-input-99-99e71d524b4b>", line 1, in <module>
runfile('C:/AppData/FinRecon/py_code/python3/DataJoin.py', wdir='C:/AppData/FinRecon/py_code/python3')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 474, in <module>
M2()
File "C:/AppData/FinRecon/py_code/python3/DataJoin.py", line 31, in M2
e1 = pd.read_csv(working_dir+"E1.txt",sep=',')
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\io\parsers.py", line 1154, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\construction.py", line 61, in arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1666, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1734, in form_blocks
int_blocks = _multi_blockify(items_dict['IntBlock'])
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1819, in _multi_blockify
values, placement = _stack_arrays(list(tup_block), dtype)
File "C:\Users\stack\AppData\Local\Continuum\anaconda3\anaconda3_32bit\lib\site-packages\pandas\core\internals\managers.py", line 1861, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError

TypeError in Dask dataframe while converting to pandas using compute()

I can't figure out what is the problem in the given code:
I am using dask to merge several dataframes. After merging I want to find the unique values from one of the column. I am getting type error while converting from dask to pandas using unique().compute(). But, I cannot seem to find what actually is the problem. It says that str cannot be assigned as int but, in some of the files the code passses through and in some it doesn't. I also cannot find the problem with data structure.
Any suggestions??
import pandas as pd
import dask.dataframe as dd
# Everything is fine until merging
# I have put several print(markers) to find the problem code
print('dask cols')
print(df_by_dask_merged.columns)
print()
print(dask_cols)
print()
print('find unique contigs values in dask dataframe')
pd_df = df_by_dask_merged['contig']
print(pd_df)
print()
print('mark 02')
# this is the problem code ??
pd_df_contig = pd_df.unique().compute()
print(pd_df_contig)
print('mark 03')
Output on Terminal:
dask cols
Index(['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al',
'ms02g_PI', 'ms02g_PG_al', 'all-freq'],
dtype='object')
['contig', 'pos', 'ref', 'all-alleles', 'ms01e_PI', 'ms01e_PG_al', 'ms02g_PI', 'ms02g_PG_al', 'all-freq']
find unique contigs values in dask dataframe
Dask Series Structure:
npartitions=1
int64
...
Name: contig, dtype: int64
Dask Name: getitem, 52 tasks
mark 02
Traceback (most recent call last):
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2145, in get_value
return tslib.get_value_box(s, key)
File "pandas/tslib.pyx", line 880, in pandas.tslib.get_value_box (pandas/tslib.c:17368)
File "pandas/tslib.pyx", line 889, in pandas.tslib.get_value_box (pandas/tslib.c:17042)
TypeError: 'str' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "merge_haplotype.py", line 305, in <module>
main()
File "merge_haplotype.py", line 152, in main
pd_df_contig = pd_df.unique().compute()
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 155, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/base.py", line 404, in compute
results = get(dsk, keys, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 521, in get_async
raise_exception(exc, tb)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 67, in reraise
raise exc
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 290, in execute_task
result = _execute_task(task, data)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/local.py", line 271, in _execute_task
return func(*args2)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 3404, in apply_and_enforce
df = func(*args, **kwargs)
File "/home/everestial007/anaconda3/lib/python3.5/site-packages/dask/utils.py", line 687, in __call__
return getattr(obj, self.method)(*args, **kwargs)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4133, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 4229, in _apply_standard
results[i] = func(v)
File "merge_haplotype.py", line 249, in <lambda>
apply(lambda row : update_cols(row, sample_name), axis=1, meta=(int))
File "merge_haplotype.py", line 278, in update_cols
if 'N|N' in df_by_dask[sample_name + '_PG_al']:
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/core/series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2153, in get_value
raise e1
File "/home/everestial007/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas/index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas/index.c:3041)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4024)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13161)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13115)
KeyError: ('ms02g_PG_al', 'occurred at index 0')

Pandas hashtable with gives key error:0

I am trying to get the same elements of two pandas data table, with indexing the datas and merge it. I use it for a very large amount of data(millions). The frist table (df) is constatn, and the second(d2) is changing in every loop, with the new elements will be merged with the first table.
here is my code for this process:
df = pd.read_csv("inputfile.csv",header=None)
d1 = pd.DataFrame(df).set_index(0)
for i in range(0, len(df)):
try:
follower_id=twitter.get_followers_ids(user_id=df.iloc[i][0],cursor=next_cursor)
f=follower_id['ids']
json.dumps(f)
d2 = pd.DataFrame(f).set_index(0)
match_result = pd.merge(d1,d2,left_index=True,right_index=True)
fk=[df.iloc[i][0] for number in range(len(match_result))]
DF = pd.DataFrame(fk)
DF.to_csv(r'output1.csv',header=None,sep=' ',index=None)
match_result.to_csv(r'output2.csv', header=None, sep=' ')
I have experienced, that this code, runs well for a while, but after that- probably it is relatad to the second databasses size wich is change every loop- the program gives me the following error message, and stop running:
Traceback (most recent call last):
File "halozat3.py", line 39, in <module>
d2 = pd.DataFrame(f).set_index(0) #1Trump koveto kovetolistaja
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 2372, in set_index
level = frame[col].values
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1678, in __getitem__
return self._getitem_column(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1685, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/python2.7/dist-packages/pandas/core/generic.py", line 1052, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 2565, in get
loc = self.items.get_loc(item)
File "/usr/lib/python2.7/dist-packages/pandas/core/index.py", line 1181, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "index.pyx", line 129, in pandas.index.IndexEngine.get_loc (pandas/index.c:3656)
File "index.pyx", line 149, in pandas.index.IndexEngine.get_loc (pandas/index.c:3534)
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6976)
KeyError: 0
What could be the problem?

Have you only one row in your dataframe?
You must write as many rows as you like
Look

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply a function from a groupby transform - python

Related

Why does pandas market calendar give "ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead."?

Handling OutOfBoundsDatetime error in pandas

MemoryError, pandas read_csv, 32 bit, dont want to use chunksize

TypeError in Dask dataframe while converting to pandas using compute()

Pandas hashtable with gives key error:0

Categories

Resources