Can an index iteration code performance be optimized with multithreading? - python

I have 2 dataframes. The first one (900 lines) contains corrections that have been applied to a deal. The second dataframe (140 000 lines) contains the list of deals with corrected values. What I am trying to do is to put the old value back.
To link the corrected deals to the corrections I have to compare a number of attributes. In the correction dataframe (900 lines) I have the old and the new value for each corrected attribute. But each correction can be corrected on a different attribute, therefore I check every possible corrected attribute (in the correction dataframe) to compare the new value with the old one and check if this attribute was corrected. If it was I put the old value back. I'm precise that a correction can apply on several deals that share the same data in the fields used to identify.
To finish, I create a new column on the Deals dataframe (140 000 lines) where I put a boolean that true when a deals has been uncorrected, false otherwise.
My code right now is quite gross, I wanted to factorize a bit but the iteration process blocked me. It is running but it has to go through 900*140 000 lines. I launched it on a quad core VM with 12Gb RAM and it went through through it in about 1h20min.
How can I improve performance? Is multithreading possible to use in this case ?
Here is shorted version of my code, just imagine that the number of if statements is 10 times bigger.
def CreationUniqueid(dataframe,Correction):
#creating new column to mark the rows we un corrected
dataframe['Modified']=0
#getting the link between the corrections and deals
b=0
for index in Correction.index:
b+=1 #just values to see progression of the program
c=0
for index1 in dataframe.index:
c+=1
a=0
print('Handling correction '+str(b)+' and deal '+str(c)) # printing progress
if (Correction.get_value(index,'BO Branch Code')==dataframe.get_value(index1,'Wings Branch') and Correction.get_value(index,'Profit Center')==dataframe.get_value(index1,'Profit Center'))
print('level 1 success')
if ((Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id') and Correction.get_value(index,'BO Trade Id')!='#') or
(Correction.get_value(index,'Emetteur Trade Id')==dataframe.get_value(index1,'Emetteur Trade Id')=='#' and Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id'))):
print('identification success')
# putting the dataframe to the old state, we need the data in the bad shape to make the computer learn what is a bad trade and what is normal
if Correction.get_value(index,'Risk Category')!=Correction.get_value(index,'Risk Categgory _M') and Correction.get_value(index,'Risk Category _M')!='':
dataframe.set_value(index1,'Risk Category',Correction.get_value(index,'Risk Category'))
a=1
print('Corr 1 ')
if Correction.get_value(index,'CEC Ricos')!=Correction.get_value(index,'CEC Ricos _M') and Correction.get_value(index,'CEC Ricos _M')!='':
dataframe.set_value(index1,'CEC Ricos',Correction.get_value(index,'CEC Ricos'))
a=1
print('Corr 2')
if Correction.get_value(index,'Product Line')!= Correction.get_value(index,'Product Line _M') and Correction.get_value(index,'Product Line _M')!='':
dataframe.set_value(index1,'Product Line Code Ricos',Correction.get_value(index,'Product Line'))
a=1
print ('corr 3')
return dataframe

Related

Getting rid of duplicates in text strings in new column by identifying differences in original data and using this difference in new column

I have sort of more general question on the process on working with text data.
My goal is to create UNIQUE short labels/description on products from existing long descriptions based on specific rules.
In practice it looks like this. I get the data that you see in column Existing Long Description and based on rules and loops in python I changed it to the data in "New_Label" column.
Existing_Long_Description
New_Label
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm blac
Edge protector BLACK RNG 1-2MM L=10M
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm red
Edge protector RED RNG 1-2MM L=10M
This shortening to the desired format is not a problem. The problem starts when checking uniqueness of "New_label" column. Due to this shortening I might create duplicates:
Existing_Long_Description
New_Label
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=1
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=6
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=8
Draw-in collet chuck dm 1-10MM
To solve this I need to add some distinguishing factor to my New_Label column based on the difference in Existing_Long_Description.
The problem is that it might not be between unknown number of articles.
I thought about following process:
Identify the duplicates in Existing_Long_description = if there are duplicates, I will know those cant be solved in New_Label
Identify the duplicates in New_Label column and if they are not in selection above = I know these can be solved
For these that can be solved I need to run some distinguisher to find where they differ and extract this difference into other column to elaborate later on what to use to New_label column
Does what I want to do make sense? As I am doing it for the first time I am wondering - is there any way of working that you recommend me?
I read some articles like this: Find the similarity metric between two strings
or elsewhere in stackoverflow I read about this: https://docs.python.org/3/library/difflib.html
That I am planning to use but still it feels rather ineffective to me and maybe here is someone who can help me.
Thanks!
A relational database would be a good fit for this problem,
with appropriate UNIQUE indexes configured.
But let's assume you're going to solve it in memory, rather than on disk.
Assume that get_longs() will read long descriptions from your data source.
dup long descriptions
Avoid processing like this:
longs = []
for long in get_longs():
if long not in longs:
longs.append(long)
Why?
It is quadratic, running in O(N^2) time, for N descriptions.
Each in takes linear O(N) time,
and we perform N such operations on the list.
To process 1000 parts would regrettably require a million operations.
Instead, take care to use an appropriate data structure, a set:
longs = set(get_longs())
That's enough to quickly de-dup the long descriptions, in linear time.
dup short descriptions
Now the fun begins.
You explained that you already have a function that works like a champ.
But we must adjust its output in the case of collisions.
class Dedup:
def __init__(self):
self.short_to_long = {}
def get_shorts(self):
"""Produces unique short descriptions."""
for long in sorted(set(get_longs())):
short = summary(long)
orig_long = self.short_to_long.get(short)
if orig_long:
short = self.deconflict(short, orig_long, long)
self.short_to_long[short] = long
yield short
def deconflict(self, short, orig_long, long):
"""Produces a novel short description that won't conflict with existing ones."""
for word in sorted(set(long.split()) - set(orig_long.split())):
short += f' {word}'
if short not in self.short_to_long: # Yay, we win!
return short
# Boo, we lose.
raise ValueError(f"Sorry, can't find a good description: {short}\n{orig_long}\n{long}")
The expression that subtracts one set from another is answering the question,
"What words in long would help me to uniqueify this result?"
Now of course, some of them may have already been used
by other short descriptions, so we take care to check for that.
Given several long descriptions
that collide in the way you're concerned about,
the 1st one will have the shortest description,
and ones appearing later will tend to have longer "short" descriptions.
The approach above is a bit simplistic, but it should get you started.
It does not, for example, distinguish between "claw hammer" and "hammer claw".
Both strings survive initial uniqueification,
but then there's no more words to help with deconflicting.
For your use case the approach above is likely to be "good enough".

pandas limited resample / windowed replacing of multiple rows of values

I am working with weather data for PV modules. The irradiance dataset (regular timeseries, 1 second data) I've been given shows an issue that occurs often in this field: occasionally, a zero value shows up when it shouldn't (daytime), e.g. due to an instrument or data writing error.
My solution that worked in the past is as below:
df['PoA_corr'] = df['PoA'].replace(0,np.nan).resample('1s').mean().interpolate(method='linear',axis=0).ffill().bfill()
where PoA: original, with issues, PoA_corr, my attempt at correcting errors.
However, as can be seen from the image below, not all of the erroneous points have been corrected appropriately: the issue is that the point where PoA == 0 is preceded and followed by 1-4 points that also are incorrect (i.e. the "V" shape in the data, with one point ==0 needs to be replaced by an interpolated line between the pre- and post- "V" points).
I have a few ideas in mind, but am stumped as to which is best, and which would be most pythonic (or able to be made so).
Get a list of indices where PoA == 0, look 3 seconds (rows) above, then replace 6-8 s (=6-8 rows) of data. I manage to find the list of points during the day using between_time and then find the point above using a timedelta, yet I don't know how to replace/overwrite the subsequent 6-8 rows (or interpolate between point "X-4" and "X+4", where X is the location where PoA == 0. The df is large (2.3 GB), so I'm loath to use a for loop on this. At the moment, my list of datetimes where PoA == 0 during day is found as:
df.between_time('09:00','16:00').loc[df['PoA']==0]['datetime']
Do some form of moving window on the data, so that if any value within the window == 0, => interpolate between first and last value of the window. Here I'm stumped as to how that could be done.
Is the solution to be found within pandas, or are numpy or pure python advisable?

Debug Exact Cover Pentominoes, Wikipedia example incomplete? OR... I'm misunderstanding something (includes code)

The Problem:
I've implemented Knuth's DLX "dancing links" algorithm for Pentominoes in two completely different ways and am still getting incorrect solutions. The trivial Wikipedia example works OK (https://en.wikipedia.org/wiki/Knuth%27s_Algorithm_X#Example), but more complex examples fail.
Debugging the full Pentominoes game requires a table with almost 2,000 entries, so I came up with a greatly reduced puzzle (pictured below) that is still complex enough to show the errant behavior.
Below is my trivial 3x5 Pentominoes example, using only 3 pieces to place. I can work through the algorithm with pen and paper, and sure enough my code is doing exactly what I told it to, but on the very first step, it nukes all of my rows! When I look at the connectedness, the columns certainly do seem to be OK. So clearly I'm misunderstanding something.
The Data Model:
This is the trivial solution I'm trying to get DLX to solve:
Below is the "moves" table, which encodes all the valid moves that the 3 pieces can make. (I filter out moves where a piece would create a hole size not divisible by 5)
The left column is the encoded move, for example the first row is
piece "L", placed at 0,0, then rotated ONE 90-degree turn
counter-clockwise.
vertical bar (|) delimiter
First 3 columns are the selector bit for which piece I'm referring to.
Since "l" is the first piece (of only 3), it has a 1 in the leftmost column.
The next 15 columns are 1 bit for every spot on a 3x5 pentominoes board.
l_0,0_rr10|100111100001000000
l_0,1_rr10|100011110000100000
l_1,1_rr10|100000000111100001
l_0,0_rr01|100111101000000000
l_0,1_rr01|100011110100000000
l_1,0_rr01|100000001111010000
l_0,0_rr30|100100001111000000
l_1,0_rr30|100000001000011110
l_1,1_rr30|100000000100001111
l_0,1_rr01|100000010111100000
l_1,0_rr01|100000000001011110
l_1,1_rr01|100000000000101111
t_0,1_rr00|010011100010000100
t_0,0_rr10|010100001110010000
t_0,1_rr20|010001000010001110
t_0,2_rr30|010000010011100001
y_1,0_rr00|001000000100011110
y_1,1_rr00|001000000010001111
y_1,0_rr01|001000000100011110
y_1,1_rr01|001000000010001111
y_0,0_rr20|001111100010000000
y_0,1_rr20|001011110001000000
y_0,0_rr01|001111100100000000
y_0,1_rr01|001011110010000000
An Example Failure:
The First Move kills all the rows of my array (disregarding the numeric header row and column)
Following the wikipedia article cited earlier, I do:
Look for minimum number of bits set in a column
4 is the min count, and column 2 is the leftmost column with that bit set
I choose the first row intersecting with column 2, which is row 13.
Column 4 and row 13 will be added to the columns and rows to be "covered" (aka deleted)
Now I look at row 13 and find all intersecting columns: 2, 5, 6, 7, 11 & 16
Now I look at all the rows that intersect with any 1 in any of those columns - THIS seem to be the problematic step - that criteria selects ALL 24 data rows to remove.
Since the board is empty, the system thinks it has found a valid solution.
Here's a picture of my pen-and-paper version of the algorithm:
Given the requests for code, I'm now attaching it. The comments at the top explain where to look.
Here's the code:
https://gist.github.com/ttennebkram/8bd27adece6fb3a5cd1bdb4ab9b51166
Second Test
There's a second 3x5 puzzle I thought of, but it hits the same problem the first example has. For the record, the second 3x5 is:
# Tiny Set 2: 3x5
# u u v v v
# u p p p v
# u u p p v
The issue you're seeing with your hand-run of the algorithm is that a matrix with no rows is not a solution. You need to eliminate all the columns, just getting rid of the rows is a failure. Your example run still has 12 columns that need to be solved left, so it's not a success.
Your exact cover implementation seems OK for the reduced instance, but the plotter was broken. I fixed it by changing
boardBitmap = fullBitmap[12:]
to
boardBitmap = fullBitmap[3:]
in plotMoveToBoard_np, since there are only three pieces in the reduced instance.
EDIT: there's also a problem with how you generate move names. There are distinct moves with the same name. There are also duplicate moves (which don't affect correctness but do affect performance). I changed
- g_rowNames.append(rowName)
+ g_rowNames.append(str(hash(str(finalBitmask))))
and 3x20 starts working as it should. (That's not a great way to generate the names, because in theory the hashes could collide, but it's one line.)

Dividing Pandas DataFrame rows into similar time-based groups

I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:

fill missing values in python array

Using: Python 2.7.1 on Windows
Hello, I fear this question has a very simple answer, but I just can't seem to find an appropriate and efficient solution (I have limited python experience). I am writing an application that just downloads historic weather data from a third party API (wundergorund). The thing is, sometimes there's no value for a given hour (eg, we have 20 degrees at 5 AM, no value for 6 AM, and 21 degrees at 7 AM). I need to have exactly one temperature value in any given hour, so I figured I could just fit the data I do have and evaluate the points I'm missing (using SciPy's polyfit). That's all cool, however, I am having problems handling my program to detect if the list has missing hours, and if so, insert the missing hour and calculate a temperature value. I hope that makes sense..
My attempt at handling the hours and temperatures list is the following:
from scipy import polyfit
# Evaluate simple cuadratic function
def tempcal (array,x):
return array[0]*x**2 + array[1]*x + array[2]
# Sample data, note it has missing hours.
# My final hrs list should look like range(25), with matching temperatures at every point
hrs = [1,2,3,6,9,11,13,14,15,18,19,20]
temps = [14.0,14.5,14.5,15.4,17.8,21.3,23.5,24.5,25.5,23.4,21.3,19.8]
# Fit coefficients
coefs = polyfit(hrs,temps,2)
# Cycle control
i = 0
done = False
while not done:
# It has missing hour, insert it and calculate a temperature
if hrs[i] != i:
hrs.insert(i,i)
temps.insert(i,tempcal(coefs,i))
# We are done, leave now
if i == 24:
done = True
i += 1
I can see why this isn't working, the program will eventually try to access indexes out of range for the hrs list. I am also aware that modifying list's length inside a loop has to be done carefully. Surely enough I am either not being careful enough or just overlooking a simpler solution altogether.
In my googling attempts to help myself I came across pandas (the library) but I feel like I can solve this problem without it, (and I would rather do so).
Any input is greatly appreciated. Thanks a lot.
When I is equal 21. It means twenty second value in list. But there is only 21 values.
In future I recommend you to use PyCharm with breakpoints for debug. Or try-except construction.
Not sure i would recommend this way of interpolating values. I would have used the closest points surrounding the missing values instead of the whole dataset. But using numpy your proposed way is fairly straight forward.
hrs = np.array(hrs)
temps = np.array(temps)
newTemps = np.empty((25))
newTemps.fill(-300) #just fill it with some invalid data, temperatures don't go this low so it should be safe.
#fill in original values
newTemps[hrs - 1] = temps
#Get indicies of missing values
missing = np.nonzero(newTemps == -300)[0]
#Calculate and insert missing values.
newTemps[missing] = tempcal(coefs, missing + 1)

Categories