check numpy array values b/w 2 columns of pandas Dataframe - python

I have a numpy array of size (352, 5) and also have a pandas DataFrame.
Objective is to check whether first and second columns of the numpy array exist in a range of 2 pandas columns and if yes then get the index of that particular row and then do something
Example:-
stats = [[ 246 1102 1678 2214 172182]
[ 678 1005 1688 2214 3528850]
[ 1031 241 17 23 331]]
df:-
hpos hpos_end vpos vpos_end
245 298 1100 1124
672 685 1000 1010
Result:-
stats[0] is present in the very first row of df since 246 lies b/w 245 and 298 and 1102 lies b/w 1100 and 1124 and same goes for next element of stats. I want to obtain the index of the row it lies in(if it does).
My approach till now:-
for x, y, w, h, area in stats[:]:
for row in df.itertuples():
if x in range(int(df['HPOS'][row.Index]), int(df['HPOS_END'][row.Index])) and y in range(
int(df['VPOS'][row.Index]), int(df['VPOS_END'][row.Index])):
desired_index = row.Index
is there a faster/optimum way to achieve my objective as iterating a df would be the last thing I would like to do
Note: both numpy array and the df are already in sorted in ascending order based on the first two columns for numpy array and ['hpos', 'vpos] for df
Any help will be appreciated, Thank you :)

Related

array numpy regression linear

I have an numpy array that has 3 rows and 3 columns .Like this:
100 200 300
233 699 999
566 655 895
and I want to create a numpy array like this for my linear regression:
100 200 300 1
233 699 999 1
566 655 895 1
This my code:
X=np.hstack((x[:,0]),x[:,1]),x[:,2]) ,np.ones(x.shape[0])))
Please how can I edit my code to get my target?
You may not break x apart in hstack
You should put the 2-dimension shape in ones because the 1-dimension shape arrays are not what you expect.
X = np.hstack((x, np.ones((x.shape[0], 1))))

Calculate and add up Data from a reference dataframe

I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted

Keep one line out of many that starts from one point

I am working on a project with OpenCV and python but stuck on this small problem.
I have end-points' coordinates on many lines stored in a list. Sometimes a case is appearing that from a single point, more than one line is detected. From among these lines, I want to keep the line of shortest length and eliminate all the other lines thus my image will contain no point from where more than one line is drawn.
My variable which stores the information(coordinates of both the end-points) of all the lines initially detected is as follows:
var = [[Line1_EndPoint1, Line1_EndPoint2],
[Line2_EndPoint1, Line2_EndPoint2],
[Line3_EndPoint1, Line3_EndPoint2],
[Line4_EndPoint1, Line4_EndPoint2],
[Line5_EndPoint1, Line5_EndPoint2]]
where, LineX_EndPointY(line number "X", endpoint "Y" of that line) is of type [x, y] where x and y are the coordinates of that point in the image.
Can someone suggest me how to solve this problem.
You can modify the way data of the lines are stored. If you modify, please explain your data structure and how it is created
Example of such data:
[[[551, 752], [541, 730]],
[[548, 738], [723, 548]],
[[285, 682], [226, 676]],
[[416, 679], [345, 678]],
[[345, 678], [388, 674]],
[[249, 679], [226, 676]],
[[270, 678], [388, 674]],
[[472, 650], [751, 473]],
[[751, 473], [716, 561]],
[[731, 529], [751, 473]]]
Python code would be appreciable.
A Numpy solution
The same result as in my first answer can be achieved based solely
on Numpy.
First define of 2 functions:
Compute square of the length of a line:
def sqLgth(line):
p1, p2 = line
return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2
Convert a vector (1D array) to a column array (a 2D array
with a single column:
def toColumn(tbl):
return tbl.reshape(-1, 1)
Both will be used later.
Then proceed as follows:
Get the number of lines:
lineNo = var.shape[0]
Generate line indices (the content of lineInd column in points array
(will be created later)):
id = np.repeat(np.arange(lineNo), 2)
Generate "origin indicators" (1 - start, 2 - end), to ease analysis
of any intermediate printouts:
origin = np.tile(np.array([1, 2]), lineNo)
Compute line lengths (the content of lgth column in points):
lgth = np.repeat([ sqLgth(line) for line in var ], 2)
Create a list of points with some additional data (consecutive
columns contain origin, lineInd, x, y and lgth):
points = np.hstack([toColumn(origin), toColumn(id),
var.reshape(-1, 2), toColumn(lgth)])
Compute the "criterion array" to sort:
r = np.core.records.fromarrays(points[:, 2:].transpose(),
names='x, y, lgth')
Sort points (by x, y and lgth):
points = points[r.argsort()]
Compute "inverse unique indices" to points:
_, inv = np.unique(points[:,2:4], axis=0, return_inverse=True)
Shift inv by 1 position:
rInv = np.roll(inv,1)
Will be used in the next step, to get the previous element.
Generate a list of line indices to drop:
toDrop = points[[ i for i in range(2 * lineNo)
if inv[i] == rInv[i] ], 1]
Row indices (in points array) are indices of repeated points (elements
in inv equal to the previous element).
Column index (1) - specifies lineInd column.
The whole result (toDrop) is a list of indices of "owning" lines
(containing the repeated points).
Generate the result: var stripped from lines selected in the
previous step:
var2 = np.delete(var, toDrop, axis=0)
To print the reduced list of lines, you can run:
for line in var2:
print(f'{line[0]}, {line[1]}')
The result is:
[551 752], [541 730]
[548 738], [723 548]
[345 678], [388 674]
[249 679], [226 676]
[731 529], [751 473]
To fully comprehend how this code works:
execute each step separately,
print the result,
compare it with printouts from previous steps.
Sometimes it is instructive to print separately even some expressions
(parts of instructions), e.g. var.reshape(-1, 2) - converting your
var (of shape (10, 2, 2)) into a 2D array of points (each row is
a point).
The whole result is of course just the same as in my first answer,
but as you wrote you had little experience in Pandas, now you can
compare both methods and see the cases where Pandas allows to do
something easier and more intuitive.
Good examples are e.g. sort by some columns or finding duplicated rows.
In Pandas it is a matter of a single instruction, with suitable
parameters, whereas in Numpy you have to use more instructions
and know various details and tricks how to do just the same.
I decided that it is easier to write a solution based on Pandas.
The reasons are that:
I can use column names (the code is better readable),
Pandas API is more powerful, although it works slower than "pure" Numpy.
Proceed as follows:
Convert var to a DataFrame:
lines = pd.DataFrame(var.reshape(10,4), columns=pd.MultiIndex.from_product(
(['P1', 'P2'], ['x','y'])))
The initial part of lines is:
P1 P2
x y x y
0 551 752 541 730
1 548 738 723 548
2 285 682 226 676
3 416 679 345 678
Compute the square of the length of each line:
lines[('', 'lgth')] = (lines[('P1', 'x')] - lines[('P2', 'x')]) ** 2\
+ (lines[('P1', 'y')] - lines[('P2', 'y')]) ** 2
lines.columns = lines.columns.droplevel()
I deliberately "stopped" at squares of length, because it is
enough to compare lenghs (computing the root will not change the
result of comparison).
Note also that the first level of the MultiIndex on columns was needed
only to easier express the columns of interest. Further on they will
not be needed, so I dropped it.
This time I put the full content of lines:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
2 285 682 226 676 3517
3 416 679 345 678 5042
4 345 678 388 674 1865
5 249 679 226 676 538
6 270 678 388 674 13940
7 472 650 751 473 109170
8 751 473 716 561 8969
9 731 529 751 473 3536
The next step is to compute points DataFrame, where all points (start and
end of each line) are in the same columns, along with the (square) length of
the corresponding line:
points = pd.concat([lines.iloc[:,[0, 1, 4]],
lines.iloc[:,[2, 3, 4]]], keys=['P1', 'P2'])\
.sort_values(['x', 'y', 'lgth']).reset_index(level=1)
Now I used iloc to specify columns (the first time for starting points
and second for ending points).
To easier read this DataFrame, I passed keys, to include "origin
indicators" and then I sorted rows.
The content is:
level_1 x y lgth
P2 5 226 676 538
P2 2 226 676 3517
P1 5 249 679 538
P1 6 270 678 13940
P1 2 285 682 3517
P1 4 345 678 1865
P2 3 345 678 5042
P2 4 388 674 1865
P2 6 388 674 13940
P1 3 416 679 5042
P1 7 472 650 109170
P2 0 541 730 584
P1 1 548 738 66725
P1 0 551 752 584
P2 8 716 561 8969
P2 1 723 548 66725
P1 9 731 529 3536
P2 9 751 473 3536
P1 8 751 473 8969
P2 7 751 473 109170
Note e.g. that point 226, 676 occurs twice. The first time it occurred
in line 5 and the second in line 2 (indices in var and lines).
To find indices of rows to drop, run:
toDrop = points[points.duplicated(subset=['x', 'y'])]\
.level_1.reset_index(drop=True);
To easier comprehend how this code works, run it step by step and
inspect results of each step.
The result is:
0 2
1 3
2 6
3 8
4 7
Name: level_1, dtype: int64
Note that the left column above is the index only (it dosn't matter).
The real information is in the right column (values).
To show lines that should be left, run:
result = lines.drop(toDrop)
getting:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
4 345 678 388 674 1865
5 249 679 226 676 538
9 731 529 751 473 3536
The above result doesn't contain e.g.:
line 2, as point 226, 676 occurred in line 5,
line 3, as point 345, 678 occurred in line 4,
Just these lines (2 and 3) have been dropped because they are
longer than both second mentioned lines (see earlier partial results).
Maybe this is enough, or if you need to drop the "duplicated" lines from
var (the original Numpy array), and save the result in another
variable, run:
var2 = np.delete(var, toDrop, axis=0)

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

Calculate new column as the mean of other columns in pandas [duplicate]

This question already has answers here:
Row-wise average for a subset of columns with missing values
(3 answers)
Closed 5 years ago.
I have a this data frame and I would like to calculate a new column as the mean of salary_1, salary_2 and salary_3:
df = pd.DataFrame({
'salary_1': [230, 345, 222],
'salary_2': [235, 375, 292],
'salary_3': [210, 385, 260]
})
salary_1 salary_2 salary_3
0 230 235 210
1 345 375 385
2 222 292 260
How can I do it in pandas in the most efficient way? Actually I have many more columns and I don't want to write this one by one.
Something like this:
salary_1 salary_2 salary_3 salary_mean
0 230 235 210 (230+235+210)/3
1 345 375 385 ...
2 222 292 260 ...
Use .mean. By specifying the axis you can take the average across the row or the column.
df['average'] = df.mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average
0 230 235 210 225.000000
1 345 375 385 368.333333
2 222 292 260 258.000000
If you only want the mean of a few you can select only those columns. E.g.
df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average_1_3
0 230 235 210 220.0
1 345 375 385 365.0
2 222 292 260 241.0
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column name
df['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.

Categories