Boolean index with Numba with strings and datetime64 - python

I am trying to convert a function that generate a Boolean index based on a date and a name to work with Numba but I have an error.
My project start with a Dataframe TS_Flujos with a structure as follow.
Fund name, Date, Var Commitment Cash flow
Fund 1 Date 1 100 -20
Fund 1 Date 2 10 -2
Fund 1 Date 3 0 +10
Fund 2 Date 3 100 0
Fund 2 Date 4 0 -10
Fund 3 Date 2 100 -20
Fund 3 Date 3 20 30
Each line is a cashflow of a specific fund. For each line I need to calculate the cumulated commitment to date and substract the amount funded to date, the "unfunded". For that, I iterate over the dataframe TS_Flujos, identify the fund and date, use a Boolean Index to identify the other "relevant rows" in the dataframe, the one of the same funds and with dates prior to the current with the following function:
def date_and_fund(date, fund, c_dates, c_funds):
i1 = (c_dates <= date)
i2 = (c_funds == fund)
result = i1 & i2
return result
And I run the following loop:
n_flujos = TS_Flujos.to_numpy()
for index in range(len(n_flujos)):
f = n_dates[index]
p = n_funds[index]
date_fund = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
This is a simplification but I also have segregate the cashflow by type and calculate many other indicators for each row. For now I have 44,000 rows but this number should increase a lot in the future and this loop already takes 1min to 2min depending of the computer. I am worried about the speed when I x10 the cashflow database and this is a small part of the total project. I have tried to understand how to use your previous answer to optimize it but I can't find a way to vectorize or use list comprehension here.
Because there is no dependency in calculation I tried to parallel the code with Numba.
#njit(parallel=True)
def cashflow(array_cashflows):
for index in prange(len(array_cashflows)):
f = n_dates[index]
p = n_funds[index]
date_funds = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
return
flujos(n_dates)
But I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/ferna/OneDrive/Python/Datalts/Dataltsweb.py", line 347, in <module>
flujos(n_fecha)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'date_and_pos': cannot determine Numba type of <class 'function'>
File "Dataltsweb.py", line 324:
def flujos(array_flujos):
<source elided>
p = n_idpos[index]
fecha_pos = date_and_pos(f, p, n_fecha, n_idpos)
^

Given the way that you have structured you're code, you won't be gaining any performance by using Numba. You're using the decorator on a function that is already vectorized, and will perform fast. What would make sense is to try and speed up the main loop, not just CapComp_MO.
In relation to the error, it seems that it has to do with the types. Try to add explicit typing see if it solves the issue, here are Numba's datatypes for datetime objects.
I'd also recommend you to avoid .iterrows() for performance issues, see this post for an explanation.
As a side note, t1[:]: this takes a full slice, and is the same as t1.
Also, if you add a minimal example (code and dataframes), it might help in improving your current approach. It looks like you're just indexing in each iteration, so you might not need to loop at all if you use numpy.

Related

Python - Pandas, csv row iteration to divide by half based on criteria met

I work with and collect data from the meters on the side of peoples houses in our service area. I worked on a python script to send high/low voltage alerts to my email whenever it occurs, but the voltage originally came in as twice what it actually was (So instead of 125V it showed 250V). So, I used pandas to divide the entire column by half.... Well, turns out a small handful of meters were programmed to send back the actual voltage of 125... So I can no longer halve the column and now must iterate and divide individually. I'm a bit new to scripting so my problem might be simple..
df = pd.read_csv(csvPath)
if "timestamp" not in df:
df.insert(0, 'timestamp',currenttimestamp)
for i in df['VoltageA']:
if df['VoltageA'][i]>200:
df['VoltageA'][i] = df['VoltageA'][i]/2
df['VoltageAMax'][i] = df['VoltageAMax'][i]/2
df['VoltageAMin'][i] = df['VoltageAMin'][i]/2
df.to_csv(csvPath, index=False)
Timestamp is there just as a 'key' to avoid duplicate errors later in the same day.
Error I am getting,
Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 250 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\admin\Documents\script.py", line 139, in <module>
updateTable(csvPath, tableName, truncate, email)
File "C:\Users\admin\Documents\script.py", line 50, in updateTable
if df['VoltageA'][i]>200.0:
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 250.0
If this isn't enough and you actually need to see a csv snippet, let me know. Just trying not to put unnecessary info here. Note, the first VoltageA value is 250.0
This example code below show how to use loc to conditionally change the value in multiple columns
import pandas as pd
df = pd.DataFrame({
'voltageA': [190,230,250,100],
'voltageMax': [200,200,200,200],
'voltageMin': [100,100,100,100]
})
df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']] = df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']]/2
df
Output
voltageA
voltageMax
voltageMin
190
200
100
115
100
50
125
100
50
100
200
100
The data in 2nd and 3rd row were divided by 2 because in the original data the value of voltageA in the two rows exceeds 200.

Having trouble with - class 'pandas.core.indexing._AtIndexer'

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

error in parallel processing in python using dataframe

I have a dataframe HH that looks like this:
end_Date latitude longitude start_Date
0 9/5/2014 41.8927 -90.4031 4/1/2014
1 9/5/2014 41.8928 -90.4031 4/1/2014
2 9/5/2014 41.8927 -90.4030 4/1/2014
3 9/5/2014 41.8928 -90.4030 4/1/2014
4 9/5/2014 41.8928 -90.4029 4/1/2014
5 9/5/2014 41.8923 -90.4028 4/1/2014
I am trying to parallelize my function using multiprocessing package in python:
here's what i wrote:
if __name__ =='__main__':
pool = Pool(200)
start = time.time()
print "Hello"
H = pool.map(funct_parallel, HH)
pool.close()
pool.join()
when I run this code, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Desktop/testparallel.py", line 198, in <module>
H = pool.map(funct_parallel, HH)
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\multiprocessing\pool.py", line 567, in get
raise self._value
TypeError: string indices must be integers, not str
not sure where I am going wrong?
pool.map requires an iterable as second argument that it feeds to the function see docs.
If you iterate over the DataFrame, you get the column names - hence the complaint about the string indices.
for i in df:
print(i)
end_Date
latitude
longitude
start_Date
You need instead to break the DataFrame into pieces that can be processed in parallel by the pool, for instance by reading the file in chunks as explained in the I/O docs.

ARIMA.from_formula with pandas dataframe

Currently, I am trying to do a seasonal ARIMA model with a 2nd order autoregressive model, a 60 day lag, and a non stationary model. When I input my formula, dataframe, and time index into the function, it returns an error. I am confused with this function because I know that I need to input the order (p,q,d) into the function somehow, but its parameters doesn't specify it. Here is my code below:
wue_formula = ' WUE ~ 1 + SFO3 + PAR + Ta + VPD'
model = tsa.arima_model.ARIMA.from_formula(wue_formula,gs_residual_df,subset = gs_residual_df.index)
File "<ipython-input-67-d518e1f9e7cc>", line 1, in <module>
tsa.arima_model.ARIMA.from_formula(wue_formula,gs_residual_df,subset = gs_residual_df.index)
File "/Users/JasonDucker/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py", line 99, in from_formula
mod = cls(endog, exog, *args, **kwargs)
File "/Users/JasonDucker/anaconda/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 872, in __new__
p, d, q = order
ValueError: too many values to unpack
Statsmodel's website has been down the past few days and I cant read the documentation for this to understand my problem fully. Some help with this code would be greatly appreciated!!!

Extracting Data From a Pandas object to put into JIRA

df is an object created by pandas which contains 13 columns of data that I want to input just data from two columns into JIRA by creating new issues. It is a 272X13 object. Each column represents a different field for an issue in JIRA. Every new issue created in JIRA should get info from just two columns in df: Summary and Comments.
How do I extract every value from the two columns as I go through each row in a for loop? I only want the string values from each row and column, no index, no object. My code is below:
from jira.client import JIRA
import pandas as pd
df = pd.read_csv('C:\\Python27\\scripts\\export.csv')
# Set the column names from the export.csv file equal to variables using the
# pandas python module
# Loop to create new issues
for row in df.iterrows():
summ = str(df.loc[row.index, 'Summary'])[:30]
comments = str(df.loc[row.index, 'Comments'])
jira.create_issue(project={'key': 'DEL'}, summary=summ, description=comments, issuetype={'name': 'Bug'})
When I do this I get the error:
Traceback (most recent call last):
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\JIRAprocess_Delta.py", line 86, in <module>
summ = str(df.loc[row.index, 'Summary'])[:30]
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 669, in __getitem__
return self._getitem_tuple(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 252, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 361, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 758, in _getitem_axis
return self._get_label(key, axis=axis)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\indexing.py", line 60, in _get_label
return self.obj._xs(label, axis=axis, copy=True)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\frame.py", line 2281, in xs
loc = self.index.get_loc(key)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\index.py", line 755, in get_loc
return self._engine.get_loc(key)
File "index.pyx", line 130, in pandas.index.IndexEngine.get_loc (pandas\index.c:3238)
File "index.pyx", line 147, in pandas.index.IndexEngine.get_loc (pandas\index.c:3085)
File "index.pyx", line 293, in pandas.index.Int64Engine._check_type (pandas\index.c:5393)
File "C:\Python27\CQPython\cqpython-read-only\src\clearquest\pandas\core\series.py", line 523, in __hash__
raise TypeError('unhashable type')
TypeError: unhashable type
TypeError: unhashable type
Here is some example data that is showing up in JIRA for every single issue created in the comments field:
Issue 1:
0 NaN
1 Found that the Delta would leak packets receiv...
2 The Delta will reset everytime you disconnect ...
3 NaN
4 It's supposed to get logged when CP takes to l...
5 Upon upgrading the IDS via the BioMed menu, th...
6 Upon upgrading the IDS via the BioMed menu, th...
7 Upon upgrading the IDS via the BioMed menu, th...
8 Increased Fusion heap size and the SCC1 Initia...
9 Recheck using build 142+, after Matt delivers ...
10 When using WPA2, there is EAPOL key echange go...
11 When using WPA2, there is EAPOL key echange go...
12 NaN
13 NaN
14 NaN
...
I only want each issue to have its own string values, and not the index numbers or the NaN to show up like this:
Issue 1:
Issue 2: Found that the Delta would leak packets receiv...
Issue 3: The Delta will reset everytime you disconnect ...
...
The problem is in the use of iterrows.
From the documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html), the function df.iterrows() iterate over DataFrame rows as (index, Series) pairs.
What you need is to replace row.index with "row[0]" which gives you the index of the dataframe you are iterating over
for row in df.iterrows():
summ = str(df.loc[row[0], 'Summary'])[:30]
comments = str(df.loc[row[0], 'Comments'])
By the way, I think you don't need iterrows at all:
for row_index in df.index:
summ = str(df.loc[row_index, 'Summary'])[:30]
comments = str(df.loc[row_index, 'Comments'])

Categories