error in parallel processing in python using dataframe

error in parallel processing in python using dataframe - python

I have a dataframe HH that looks like this:
end_Date latitude longitude start_Date
0 9/5/2014 41.8927 -90.4031 4/1/2014
1 9/5/2014 41.8928 -90.4031 4/1/2014
2 9/5/2014 41.8927 -90.4030 4/1/2014
3 9/5/2014 41.8928 -90.4030 4/1/2014
4 9/5/2014 41.8928 -90.4029 4/1/2014
5 9/5/2014 41.8923 -90.4028 4/1/2014
I am trying to parallelize my function using multiprocessing package in python:
here's what i wrote:
if __name__ =='__main__':
pool = Pool(200)
start = time.time()
print "Hello"
H = pool.map(funct_parallel, HH)
pool.close()
pool.join()
when I run this code, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Desktop/testparallel.py", line 198, in <module>
H = pool.map(funct_parallel, HH)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
raise self._value
TypeError: string indices must be integers, not str
not sure where I am going wrong?

pool.map requires an iterable as second argument that it feeds to the function see docs.
If you iterate over the DataFrame, you get the column names - hence the complaint about the string indices.
for i in df:
print(i)
end_Date
latitude
longitude
start_Date
You need instead to break the DataFrame into pieces that can be processed in parallel by the pool, for instance by reading the file in chunks as explained in the I/O docs.

Related

I cannot trim a string in a dataframe, if the string is in the first record of dataframe

I use python 3.8 with pandas. I have some data in csv file. I get data into pandas dataframe and try to delete some parts of Client_Name strings. Some Client_Name s are like Client_Name = myserv(4234) so i have to clean (4234) to make Client_Name = myserv. Shortly, i have to clean the paranthesis (4234) from Client_Name s.
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
I wrote above code and it is cleaning the paranthesis from Client_Name s. My problem is if the (4234) is at the first row of dataframe it gives error. If (4234) is at other rows there is no problem.
The working data is :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3(4234),6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5(4234),3<br>
5,2018-10-14T21:02:29Z,myserv6(4234),3<br>
When i run my code it deletes the (4234) s and data turn into below format :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1,5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
But if the (4234) is on the first row like below, my code throws error :
,time,Client_Name,Minutes<br>
0,2018-10-14T21:01:00Z,myserv1(4234),5<br>
1,2018-10-14T21:01:00Z,myserv2,5<br>
2,2018-10-14T21:01:00Z,myserv3,6<br>
3,2018-10-14T21:01:00Z,myserv4,6<br>
4,2018-10-14T21:02:07Z,myserv5,3<br>
5,2018-10-14T21:02:29Z,myserv6,3<br>
The error is :
test1.py:97: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 972, in __setitem__
self._set_with_engine(key, value)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 1005, in _set_with_engine
loc = self.index._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '0 True
1 False
2 False
3 False
4 False
...
116 False
117 False
118 False
119 False
120 False
Name: Client_Name, Length: 121, dtype: bool' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test1.py", line 97, in <module>
df.Client_Name[df.Client_Name.str.contains('\(')] = df.Client_Name.str.split("(")[0]
File "/usr/local/lib/python3.8/dist-packages/pandas/core/series.py", line 992, in __setitem__
self._where(~key, value, inplace=True)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 9129, in _where
new_data = self._mgr.putmask(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 579, in putmask
return self.apply(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 427, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py", line 1144, in putmask
raise ValueError("cannot assign mismatch length to masked array")
ValueError: cannot assign mismatch length to masked array

Your slicing method generates a copy, which you modify, this is giving the warning.
You could use instead:
df['Client_Name'] = df['Client_Name'].str.replace('\(.*?\)', '', regex=True)
output:
time Client_Name Minutes
0 2018-10-14T21:01:00Z myserv1 5
1 2018-10-14T21:01:00Z myserv2 5
2 2018-10-14T21:01:00Z myserv3 6
3 2018-10-14T21:01:00Z myserv4 6
4 2018-10-14T21:02:07Z myserv5 3
5 2018-10-14T21:02:29Z myserv6 3

Python Apply passing arguments to function works for one function but not a second

I have a data frame that contains hundreds of stock tickers along with price information for each ticker each day. It looks something like this:
ticker date open high low close volume
0 ZEST 2011-01-03 537.500 537.500 537.500 537.500 2.0
1 WHG 2011-01-03 40.230 40.660 40.110 40.500 9200.0
2 ZEST 2011-01-04 31.300 31.660 31.160 31.580 34397100.0
3 WHG 2011-01-04 17.030 17.150 16.870 16.900 621800.0
4 ZEST 2011-01-05 31.230 31.230 30.960 31.030 1273300.0
5 WHG 2011-01-05 31.230 31.230 30.960 31.030 1273300.0
I am trying to calculate indicator values for each security so I am creating a group and functions and applying the function to the group:
import pandas as pd
import talib as ta
df = pd.read_csv(r"C:\BulkInsert.csv", index_col=False)
def my_ATR(df):
return ta.ATR(
high=df.high,
low=df.low,
close=df.close,
timeperiod=14
)
def my_RSI(df):
return ta.RSI(
close=df.close,
timeperiod=14
)
new_ATR = df.groupby("ticker", as_index=False).apply(my_ATR)
df['ATR'] = new_ATR.reset_index(level=0, drop=True)
new_RSI = df.groupby("ticker", as_index=False).apply(my_RSI)
df['RSI'] = new_RSI.reset_index(level=0, drop=True)
The my_ATR function calculates an ATR value on the grouped securities without issue but the my_RSI function returns the following error:
Traceback (most recent call last):
File "C:\Users\noozak\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1253, in apply
result = self._python_apply_general(f, self._selected_obj)
File "C:\Users\noozak\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\groupby.py", line 1287, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "C:\Users\noozak\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 783, in apply
result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
File "C:\Users\noozak\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\groupby\ops.py", line 1328, in fast_apply
return libreduction.apply_frame_axis0(sdata, f, names, starts, ends)
File "pandas\_libs\reduction.pyx", line 381, in pandas._libs.reduction.apply_frame_axis0
File "c:\Users\noozak\OneDrive\Desktop\APP\.vscode\test1.py", line 39, in my_RSI
return ta.RSI(
File "C:\Users\noozak\AppData\Local\Programs\Python\Python39\lib\site-packages\talib\__init__.py", line 35, in wrapper
result = func(*args, **kwargs)
File "_func.pxi", line 4344, in talib._ta_lib.RSI
TypeError: RSI() takes at least 1 positional argument (0 given)
I am not sure what is different between these two lines and why the arguments are passed to my_ATR and not my_RSI.

Boolean index with Numba with strings and datetime64

I am trying to convert a function that generate a Boolean index based on a date and a name to work with Numba but I have an error.
My project start with a Dataframe TS_Flujos with a structure as follow.
Fund name, Date, Var Commitment Cash flow
Fund 1 Date 1 100 -20
Fund 1 Date 2 10 -2
Fund 1 Date 3 0 +10
Fund 2 Date 3 100 0
Fund 2 Date 4 0 -10
Fund 3 Date 2 100 -20
Fund 3 Date 3 20 30
Each line is a cashflow of a specific fund. For each line I need to calculate the cumulated commitment to date and substract the amount funded to date, the "unfunded". For that, I iterate over the dataframe TS_Flujos, identify the fund and date, use a Boolean Index to identify the other "relevant rows" in the dataframe, the one of the same funds and with dates prior to the current with the following function:
def date_and_fund(date, fund, c_dates, c_funds):
i1 = (c_dates <= date)
i2 = (c_funds == fund)
result = i1 & i2
return result
And I run the following loop:
n_flujos = TS_Flujos.to_numpy()
for index in range(len(n_flujos)):
f = n_dates[index]
p = n_funds[index]
date_fund = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
This is a simplification but I also have segregate the cashflow by type and calculate many other indicators for each row. For now I have 44,000 rows but this number should increase a lot in the future and this loop already takes 1min to 2min depending of the computer. I am worried about the speed when I x10 the cashflow database and this is a small part of the total project. I have tried to understand how to use your previous answer to optimize it but I can't find a way to vectorize or use list comprehension here.
Because there is no dependency in calculation I tried to parallel the code with Numba.
#njit(parallel=True)
def cashflow(array_cashflows):
for index in prange(len(array_cashflows)):
f = n_dates[index]
p = n_funds[index]
date_funds = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
return
flujos(n_dates)
But I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/ferna/OneDrive/Python/Datalts/Dataltsweb.py", line 347, in <module>
flujos(n_fecha)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'date_and_pos': cannot determine Numba type of <class 'function'>
File "Dataltsweb.py", line 324:
def flujos(array_flujos):
<source elided>
p = n_idpos[index]
fecha_pos = date_and_pos(f, p, n_fecha, n_idpos)
^

Given the way that you have structured you're code, you won't be gaining any performance by using Numba. You're using the decorator on a function that is already vectorized, and will perform fast. What would make sense is to try and speed up the main loop, not just CapComp_MO.
In relation to the error, it seems that it has to do with the types. Try to add explicit typing see if it solves the issue, here are Numba's datatypes for datetime objects.
I'd also recommend you to avoid .iterrows() for performance issues, see this post for an explanation.
As a side note, t1[:]: this takes a full slice, and is the same as t1.
Also, if you add a minimal example (code and dataframes), it might help in improving your current approach. It looks like you're just indexing in each iteration, so you might not need to loop at all if you use numpy.

Simple Pandas issue, Python

I want to import a txt file, and do a few basic actions on it.
For some reason I keep getting an unhashable type error, not sure what the issue is:
def loadAndPrepData(filepath):
import pandas as pd
pd.set_option('display.width',200)
dataFrame = pd.read_csv(filepath,header=0,sep='\t') #set header to first row and sep by tab
df = dataFrame[0:639,:]
print df
filepath = 'United States Cancer Statistics, 1999-2011 Incidencet.txt'
loadAndPrepData(filepath)
Traceback:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 16, in <module>
loadAndPrepData(filepath)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 12, in loadAndPrepData
df = dataFrame[0:639,:]
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1797, in __getitem__
return self._getitem_column(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\frame.py", line 1804, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\generic.py", line 1082, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type

The problem is that using the item getter ([]) needs hashable types. When you provide it with [:] this is fine, but when you provide it with [:,:], you will get this error.
pd.DataFrame({"foo":range(1,10)})[:,:]
TypeError: unhashable type
While this works just fine:
pd.DataFrame({"foo":range(1,10)})[:]
However, you should be using .loc no matter how you want to slice.
pd.DataFrame({"foo":range(1,10)}).loc[:,:]
foo
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9

Extracting Data From a Pandas object to put into JIRA

df is an object created by pandas which contains 13 columns of data that I want to input just data from two columns into JIRA by creating new issues. It is a 272X13 object. Each column represents a different field for an issue in JIRA. Every new issue created in JIRA should get info from just two columns in df: Summary and Comments.
How do I extract every value from the two columns as I go through each row in a for loop? I only want the string values from each row and column, no index, no object. My code is below:
from jira.client import JIRA
import pandas as pd
df = pd.read_csv('C:\\Python27\\scripts\\export.csv')
# Set the column names from the export.csv file equal to variables using the
# pandas python module
# Loop to create new issues
for row in df.iterrows():
summ = str(df.loc[row.index, 'Summary'])[:30]
comments = str(df.loc[row.index, 'Comments'])
jira.create_issue(project={'key': 'DEL'}, summary=summ, description=comments, issuetype={'name': 'Bug'})
When I do this I get the error:
Traceback (most recent call last):
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\JIRAprocess_Delta.py", line 86, in <module>
summ = str(df.loc[row.index, 'Summary'])[:30]
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 669, in __getitem__
return self._getitem_tuple(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 252, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 361, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 758, in _getitem_axis
return self._get_label(key, axis=axis)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 60, in _get_label
return self.obj._xs(label, axis=axis, copy=True)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\frame.py", line 2281, in xs
loc = self.index.get_loc(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\index.py", line 755, in get_loc
return self._engine.get_loc(key)
File "index.pyx", line 130, in pandas.index.IndexEngine.get_loc (pandas\index.c:3238)
File "index.pyx", line 147, in pandas.index.IndexEngine.get_loc (pandas\index.c:3085)
File "index.pyx", line 293, in pandas.index.Int64Engine._check_type (pandas\index.c:5393)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\series.py", line 523, in __hash__
raise TypeError('unhashable type')
TypeError: unhashable type
TypeError: unhashable type
Here is some example data that is showing up in JIRA for every single issue created in the comments field:
Issue 1:
0 NaN
1 Found that the Delta would leak packets receiv...
2 The Delta will reset everytime you disconnect ...
3 NaN
4 It's supposed to get logged when CP takes to l...
5 Upon upgrading the IDS via the BioMed menu, th...
6 Upon upgrading the IDS via the BioMed menu, th...
7 Upon upgrading the IDS via the BioMed menu, th...
8 Increased Fusion heap size and the SCC1 Initia...
9 Recheck using build 142+, after Matt delivers ...
10 When using WPA2, there is EAPOL key echange go...
11 When using WPA2, there is EAPOL key echange go...
12 NaN
13 NaN
14 NaN
...
I only want each issue to have its own string values, and not the index numbers or the NaN to show up like this:
Issue 1:
Issue 2: Found that the Delta would leak packets receiv...
Issue 3: The Delta will reset everytime you disconnect ...
...

The problem is in the use of iterrows.
From the documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html), the function df.iterrows() iterate over DataFrame rows as (index, Series) pairs.
What you need is to replace row.index with "row[0]" which gives you the index of the dataframe you are iterating over
for row in df.iterrows():
summ = str(df.loc[row[0], 'Summary'])[:30]
comments = str(df.loc[row[0], 'Comments'])
By the way, I think you don't need iterrows at all:
for row_index in df.index:
summ = str(df.loc[row_index, 'Summary'])[:30]
comments = str(df.loc[row_index, 'Comments'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

error in parallel processing in python using dataframe - python

Related

I cannot trim a string in a dataframe, if the string is in the first record of dataframe

Python Apply passing arguments to function works for one function but not a second

Boolean index with Numba with strings and datetime64

Simple Pandas issue, Python

Extracting Data From a Pandas object to put into JIRA

Categories

Resources