How to (efficiently) check if any two elements differ by 10 - python

Suppose I have the following column in a -- pandas -- dataframe:
x
1 589
2 354
3 692
4 474
5 739
6 731
7 259
8 723
9 497
10 48
Note: I've changed the indexing to start at 1 (see test data).
I simply wish to test if the difference between any two of the items in this column are less than 10.
Final result: No two elements should have an absolute difference less than 10.
Goal:
x
1 589
2 354
3 692
4 474
5 749 #
6 731
7 259
8 713 #
9 497
10 48
Perhaps this could be done using:
for index, row in df.iterrows():
However, that has not be successful thus far...
Given I'm looking to perform element-wise comparisions, I don't expect staging speed...
Test Data:
import pandas as pd
df = pd.DataFrame(index = range(1,stim_numb+1), columns= ['x'])
df['x'] = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]

One solution might be to sort the list, then compare consecutive items, adding 10 whenever the difference is too small, and then sorting the list back to the original order (if necessary).
from operator import itemgetter
lst = [589, 354, 692, 474, 739, 731, 259, 723, 497, 48]
# temp is list as pairs of original index and value, sorted by value
temp = [[i, e] for i, e in sorted(enumerate(lst), key=itemgetter(1))]
last = None
for item in temp:
while last is not None and item[1] < last + 10:
item[1] += 10
last = item[1]
# sort the list back to original order using the index from the tuple
lst_new = [e for i, e in sorted(temp, key=itemgetter(0))]
Result is [589, 354, 692, 474, 759, 741, 259, 723, 497, 48]
This is using plain Python lists; maybe it can be done more elegantly in Pandas or Numpy.

Related

Keep one line out of many that starts from one point

I am working on a project with OpenCV and python but stuck on this small problem.
I have end-points' coordinates on many lines stored in a list. Sometimes a case is appearing that from a single point, more than one line is detected. From among these lines, I want to keep the line of shortest length and eliminate all the other lines thus my image will contain no point from where more than one line is drawn.
My variable which stores the information(coordinates of both the end-points) of all the lines initially detected is as follows:
var = [[Line1_EndPoint1, Line1_EndPoint2],
[Line2_EndPoint1, Line2_EndPoint2],
[Line3_EndPoint1, Line3_EndPoint2],
[Line4_EndPoint1, Line4_EndPoint2],
[Line5_EndPoint1, Line5_EndPoint2]]
where, LineX_EndPointY(line number "X", endpoint "Y" of that line) is of type [x, y] where x and y are the coordinates of that point in the image.
Can someone suggest me how to solve this problem.
You can modify the way data of the lines are stored. If you modify, please explain your data structure and how it is created
Example of such data:
[[[551, 752], [541, 730]],
[[548, 738], [723, 548]],
[[285, 682], [226, 676]],
[[416, 679], [345, 678]],
[[345, 678], [388, 674]],
[[249, 679], [226, 676]],
[[270, 678], [388, 674]],
[[472, 650], [751, 473]],
[[751, 473], [716, 561]],
[[731, 529], [751, 473]]]
Python code would be appreciable.
A Numpy solution
The same result as in my first answer can be achieved based solely
on Numpy.
First define of 2 functions:
Compute square of the length of a line:
def sqLgth(line):
p1, p2 = line
return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2
Convert a vector (1D array) to a column array (a 2D array
with a single column:
def toColumn(tbl):
return tbl.reshape(-1, 1)
Both will be used later.
Then proceed as follows:
Get the number of lines:
lineNo = var.shape[0]
Generate line indices (the content of lineInd column in points array
(will be created later)):
id = np.repeat(np.arange(lineNo), 2)
Generate "origin indicators" (1 - start, 2 - end), to ease analysis
of any intermediate printouts:
origin = np.tile(np.array([1, 2]), lineNo)
Compute line lengths (the content of lgth column in points):
lgth = np.repeat([ sqLgth(line) for line in var ], 2)
Create a list of points with some additional data (consecutive
columns contain origin, lineInd, x, y and lgth):
points = np.hstack([toColumn(origin), toColumn(id),
var.reshape(-1, 2), toColumn(lgth)])
Compute the "criterion array" to sort:
r = np.core.records.fromarrays(points[:, 2:].transpose(),
names='x, y, lgth')
Sort points (by x, y and lgth):
points = points[r.argsort()]
Compute "inverse unique indices" to points:
_, inv = np.unique(points[:,2:4], axis=0, return_inverse=True)
Shift inv by 1 position:
rInv = np.roll(inv,1)
Will be used in the next step, to get the previous element.
Generate a list of line indices to drop:
toDrop = points[[ i for i in range(2 * lineNo)
if inv[i] == rInv[i] ], 1]
Row indices (in points array) are indices of repeated points (elements
in inv equal to the previous element).
Column index (1) - specifies lineInd column.
The whole result (toDrop) is a list of indices of "owning" lines
(containing the repeated points).
Generate the result: var stripped from lines selected in the
previous step:
var2 = np.delete(var, toDrop, axis=0)
To print the reduced list of lines, you can run:
for line in var2:
print(f'{line[0]}, {line[1]}')
The result is:
[551 752], [541 730]
[548 738], [723 548]
[345 678], [388 674]
[249 679], [226 676]
[731 529], [751 473]
To fully comprehend how this code works:
execute each step separately,
print the result,
compare it with printouts from previous steps.
Sometimes it is instructive to print separately even some expressions
(parts of instructions), e.g. var.reshape(-1, 2) - converting your
var (of shape (10, 2, 2)) into a 2D array of points (each row is
a point).
The whole result is of course just the same as in my first answer,
but as you wrote you had little experience in Pandas, now you can
compare both methods and see the cases where Pandas allows to do
something easier and more intuitive.
Good examples are e.g. sort by some columns or finding duplicated rows.
In Pandas it is a matter of a single instruction, with suitable
parameters, whereas in Numpy you have to use more instructions
and know various details and tricks how to do just the same.
I decided that it is easier to write a solution based on Pandas.
The reasons are that:
I can use column names (the code is better readable),
Pandas API is more powerful, although it works slower than "pure" Numpy.
Proceed as follows:
Convert var to a DataFrame:
lines = pd.DataFrame(var.reshape(10,4), columns=pd.MultiIndex.from_product(
(['P1', 'P2'], ['x','y'])))
The initial part of lines is:
P1 P2
x y x y
0 551 752 541 730
1 548 738 723 548
2 285 682 226 676
3 416 679 345 678
Compute the square of the length of each line:
lines[('', 'lgth')] = (lines[('P1', 'x')] - lines[('P2', 'x')]) ** 2\
+ (lines[('P1', 'y')] - lines[('P2', 'y')]) ** 2
lines.columns = lines.columns.droplevel()
I deliberately "stopped" at squares of length, because it is
enough to compare lenghs (computing the root will not change the
result of comparison).
Note also that the first level of the MultiIndex on columns was needed
only to easier express the columns of interest. Further on they will
not be needed, so I dropped it.
This time I put the full content of lines:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
2 285 682 226 676 3517
3 416 679 345 678 5042
4 345 678 388 674 1865
5 249 679 226 676 538
6 270 678 388 674 13940
7 472 650 751 473 109170
8 751 473 716 561 8969
9 731 529 751 473 3536
The next step is to compute points DataFrame, where all points (start and
end of each line) are in the same columns, along with the (square) length of
the corresponding line:
points = pd.concat([lines.iloc[:,[0, 1, 4]],
lines.iloc[:,[2, 3, 4]]], keys=['P1', 'P2'])\
.sort_values(['x', 'y', 'lgth']).reset_index(level=1)
Now I used iloc to specify columns (the first time for starting points
and second for ending points).
To easier read this DataFrame, I passed keys, to include "origin
indicators" and then I sorted rows.
The content is:
level_1 x y lgth
P2 5 226 676 538
P2 2 226 676 3517
P1 5 249 679 538
P1 6 270 678 13940
P1 2 285 682 3517
P1 4 345 678 1865
P2 3 345 678 5042
P2 4 388 674 1865
P2 6 388 674 13940
P1 3 416 679 5042
P1 7 472 650 109170
P2 0 541 730 584
P1 1 548 738 66725
P1 0 551 752 584
P2 8 716 561 8969
P2 1 723 548 66725
P1 9 731 529 3536
P2 9 751 473 3536
P1 8 751 473 8969
P2 7 751 473 109170
Note e.g. that point 226, 676 occurs twice. The first time it occurred
in line 5 and the second in line 2 (indices in var and lines).
To find indices of rows to drop, run:
toDrop = points[points.duplicated(subset=['x', 'y'])]\
.level_1.reset_index(drop=True);
To easier comprehend how this code works, run it step by step and
inspect results of each step.
The result is:
0 2
1 3
2 6
3 8
4 7
Name: level_1, dtype: int64
Note that the left column above is the index only (it dosn't matter).
The real information is in the right column (values).
To show lines that should be left, run:
result = lines.drop(toDrop)
getting:
x y x y lgth
0 551 752 541 730 584
1 548 738 723 548 66725
4 345 678 388 674 1865
5 249 679 226 676 538
9 731 529 751 473 3536
The above result doesn't contain e.g.:
line 2, as point 226, 676 occurred in line 5,
line 3, as point 345, 678 occurred in line 4,
Just these lines (2 and 3) have been dropped because they are
longer than both second mentioned lines (see earlier partial results).
Maybe this is enough, or if you need to drop the "duplicated" lines from
var (the original Numpy array), and save the result in another
variable, run:
var2 = np.delete(var, toDrop, axis=0)

Pandas grouby 7 days

How to sum my data counts by week and if the last week still not completed calculate the average "normalization"
let's say these is my lists
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
Thanks
Here is a solution with Pandas:
1) Create dataframe:
df = pd.DataFrame({'days':days,'counts': counts})
df['week'] = df.days.sub(1)//7 # adding week column
2) calculate sum and mean by week, then producing normalized sum:
d2 = df.groupby('week').agg({'counts':['sum','mean']}) # ca
d2['norm_sum'] = d2[('counts','mean')] * 7
3) output:
print (d2)
counts norm_sum
sum mean
week
0 10102 1683.666667 11785.666667
1 10158 1451.142857 10158.000000
2 8787 1255.285714 8787.000000
3 4695 670.714286 4695.000000
4 3814 953.500000 6674.500000
I do not know how to use pandas in this case, but I would do it using built-in python modules following way:
from collections import defaultdict
from statistics import mean
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
weeks = [d//7 for d in days]
avg_count = int(mean(counts))
weeks = weeks + [weeks[-1]]*(len(weeks)%7) # pad weeks to multiply of 7
counts = counts + [avg_count]*(len(counts)%7) # pad counts to multiply of 7
count_per_week = defaultdict(int)
for w, c in zip(weeks, counts):
count_per_week[w] += c
print(dict(count_per_week))
Output:
{0: 10102, 1: 10158, 2: 8787, 3: 4695, 4: 3814}
Note that I assume average is reasonable filler value, which do not have to always holds true. defaultdict(int) when asked for non-existing key will set that key value to int() that is 0.
This my approach:
The data:
counts = [1839,1334,2241,2063,1216,1409,1614,1860,1298,1140,1122,2153,971,1650,1835,889,653,484,2078,1198,426,684,910,701,851,360,763,402,1853,400,1159]
To array:
counts = np.array(counts)
Reshape: (thanks to https://stackoverflow.com/users/4427777/daniel-f)
def shapeshifter(num_col, my_array=data):
return np.lib.pad(my_array, (0, num_col - len(my_array) % num_col), 'constant', constant_values = 0).reshape(-1, num_col)
data = shapeshifter(7, counts)
array([[1839, 1334, 2241, 2063, 1216, 1409, 1614],
[1860, 1298, 1140, 1122, 2153, 971, 1650],
[1835, 889, 653, 484, 2078, 1198, 426],
[ 684, 910, 701, 851, 360, 763, 402],
[1853, 400, 1159, 0, 0, 0, 0]])
To dataframe with zeros converted to NaN:
df = pd.DataFrame(data)
df[df == 0] = np.nan
Fill missing values with the mean value of the month:
df.fillna(counts.mean())
0 1 2 3 4 5 6
0 1839 1334 2241 2063.000000 1216.000000 1409.000000 1614.000000
1 1860 1298 1140 1122.000000 2153.000000 971.000000 1650.000000
2 1835 889 653 484.000000 2078.000000 1198.000000 426.000000
3 684 910 701 851.000000 360.000000 763.000000 402.000000
4 1853 400 1159 1211.483871 1211.483871 1211.483871 1211.483871
Get the sum by row or week:
df.sum(axis=1)
0 11716.0
1 10194.0
2 7563.0
3 4671.0
4 3412.0
dtype: float64

Is there a way to remove similar (numerical) elements from array in python

I have a function which produces an array as such:
[ 14 48 81 111 112 113 114 148 179 213 247 279 311 313 314 344 345 346]
which corresponds to data values where a curve crosses the x axis. As the data is imperfect, it generates false positives, where my output array has elements all very close to each other e.g. [111 112 113 114]. I need to remove the false positives from this array but still retain the initial positive around where the false positives are showing. Basically I need my function to produce and array more like
[ 14 48 81 112 148 179 213 247 279 313 345]
where the false positives from imperfect data have been removed.
Here is a possible approach:
arr = [14, 48, 81, 111, 112, 113, 114, 148, 179, 213, 247, 279, 311, 313, 314, 344, 345, 346]
def filter_arr(arr, offset):
filtered_nums = set()
for num in sorted(arr):
# Check if there are any "similar" numbers already found
if any(num+x in filtered_nums for x in range(-offset, offset+1)):
continue
else:
filtered_nums.add(num)
return list(sorted(filtered_nums))
Then you can apply the filtering with any offset that you think makes the most sense.
filter_arr(arr, offset=5)
Output: [14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
This can do
#arr is the array you want, num is the number difference between them
def check(arr, num):
for r in arr:
for c in arr:
if abs(r-c) < num + 1:
arr.remove(c)
return arr
yourarray = [14,48 ,81 ,111 ,112 ,113 ,114, 148 , 179 ,213 ,247 ,279 ,311, 313 ,314 ,344, 345, 346]
print(check(yourarray, 1))
I would do it following way:
Conceptually:
Lets say that ten of number is quantity of 10 which could be fitted into given number for example ten of 111 is 11, ten of 247 is 24 and ten of 250 is 25 and so on.
For our data if number with given ten already exist discard it.
Code:
data = [14,48,81,111,112,113,114,148,179,213,247,279,311,313,314,344,345,346]
cleaned = [i for inx,i in enumerate(data) if not i//10 in [j//10 for j in data[:inx]]]
print(cleaned) #[14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
Note that 10 is only example value, that you can replace with another value - bigger value means more elements will be potentially removed. Keep in mind that specific trait of this solution is that specific values pairs (for 10 for example 110 and 111) will be treated as different and would stay in output list, so you need to examine if that is not a problem in your case of usage.

Making pair by 2 rows and slicing from each row

I have a dataframe like:
x1 y1 x2 y2
0 149 2653 2152 2656
1 149 2465 2152 2468
2 149 1403 2152 1406
3 149 1215 2152 1218
4 170 2692 2170 2695
5 170 2475 2170 2478
6 170 1413 2170 1416
7 170 1285 2170 1288
I need to pair by each two rows from data frame index. i.e., [0,1], [2,3], [4,5], [6,7] etc.,
and extract x1,y1 from first row of the pair x2,y2 from second row of the pair, similarly for each pair of rows.
Sample Output:
[[149,2653,2152,2468],[149,1403,2152,1218],[170,2692,2170,2478],[170,1413,2170,1288]]
Please feel free to ask if it's not clear.
So far I tried grouping by pairs, and tried shift operation.
But I didn't manage to make make pair records.
Python solution:
Select values of columns by positions to lists:
a = df[['x2', 'y2']].iloc[1::2].values.tolist()
b = df[['x1', 'y1']].iloc[0::2].values.tolist()
And then zip and join together in list comprehension:
L = [y + x for x, y in zip(a, b)]
print (L)
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Thank you, #user2285236 for another solution:
L = np.concatenate([df.loc[::2, ['x1', 'y1']], df.loc[1::2, ['x2', 'y2']]], axis=1).tolist()
Pure pandas solution:
First DataFrameGroupBy.shift by each 2 rows:
df[['x2', 'y2']] = df.groupby(np.arange(len(df)) // 2)[['x2', 'y2']].shift(-1)
print (df)
x1 y1 x2 y2
0 149 2653 2152.0 2468.0
1 149 2465 NaN NaN
2 149 1403 2152.0 1218.0
3 149 1215 NaN NaN
4 170 2692 2170.0 2478.0
5 170 2475 NaN NaN
6 170 1413 2170.0 1288.0
7 170 1285 NaN NaN
Then remove NaNs rows, convert to int and then to list:
print (df.dropna().astype(int).values.tolist())
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218],
[170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]
Here's one solution via numpy.hstack. Note it is natural to feed numpy arrays directly to pd.DataFrame, since this is how Pandas stores data internally.
import numpy as np
arr = np.hstack((df[['x1', 'y1']].values[::2],
df[['x2', 'y2']].values[1::2]))
res = pd.DataFrame(arr)
print(res)
0 1 2 3
0 149 2653 2152 2468
1 149 1403 2152 1218
2 170 2692 2170 2478
3 170 1413 2170 1288
Here's a solution using a custom iterator based on iterrows(), but it's a bit clunky:
import pandas as pd
df = pd.DataFrame( columns=['x1','y1','x2','y2'], data=
[[149, 2653, 2152, 2656], [149, 2465, 2152, 2468], [149, 1403, 2152, 1406], [149, 1215, 2152, 1218],
[170, 2692, 2170, 2695], [170, 2475, 2170, 2478], [170, 1413, 2170, 1416], [170, 1285, 2170, 1288]] )
def iter_oddeven_pairs(df):
row_it = df.iterrows()
try:
while True:
_,row = next(row_it)
yield row[0:2]
_,row = next(row_it)
yield row[2:4]
except StopIteration:
pass
print(pd.concat([pair for pair in iter_oddeven_pairs(df)]))

Unsure why program similar to bubble-sort is not working

I have been working on a programming challenge, problem here, which basically states:
Given integer array, you are to iterate through all pairs of neighbor
elements, starting from beginning - and swap members of each pair
where first element is greater than second.
And then return the amount of swaps made and the checksum of the final answer. My program seemingly does both the sorting and the checksum according to how it wants. But my final answer is off for everything but the test input they gave.
So: 1 4 3 2 6 5 -1
Results in the correct output: 3 5242536 with my program.
But something like:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
Results in: 39 1291223 when the correct answer is 39 3485793.
Here's what I have at the moment:
# Python 2.7
def check_sum(data):
data = [str(x) for x in str(data)[::]]
numbers = len(data)
result = 0
for number in range(numbers):
result += int(data[number])
result *= 113
result %= 10000007
return(str(result))
def bubble_in_array(data):
numbers = data[:-1]
numbers = [int(x) for x in numbers]
swap_count = 0
for x in range(len(numbers)-1):
if numbers[x] > numbers[x+1]:
temp = numbers[x+1]
numbers[x+1] = numbers[x]
numbers[x] = temp
swap_count += 1
raw_number = int(''.join([str(x) for x in numbers]))
print('%s %s') % (str(swap_count), check_sum(raw_number))
bubble_in_array(raw_input().split())
Does anyone have any idea where I am going wrong?
The issue is with your way of calculating Checksum. It fails when the array has numbers with more than one digit. For example:
2 96 7439 92999 240 70748 3 842 74 706 4 86 7 463 1871 7963 904 327 6268 20955 92662 278 57 8 5912 724 70916 13 388 1 697 99666 6924 2 100 186 37504 1 27631 59556 33041 87 9 45276 -1
You are calculating Checksum for 2967439240707483842747064867463187179639043276268209559266227857859127247091613388169792999692421001863750412763159556330418794527699666
digit by digit while you should calculate the Checksum of [2, 96, 7439, 240, 70748, 3, 842, 74, 706, 4, 86, 7, 463, 1871, 7963, 904, 327, 6268, 20955, 92662, 278, 57, 8, 5912, 724, 70916, 13, 388, 1, 697, 92999, 6924, 2, 100, 186, 37504, 1, 27631, 59556, 33041, 87, 9, 45276, 99666]
The fix:
# Python 2.7
def check_sum(data):
result = 0
for number in data:
result += number
result *= 113
result %= 10000007
return(result)
def bubble_in_array(data):
numbers = [int(x) for x in data[:-1]]
swap_count = 0
for x in xrange(len(numbers)-1):
if numbers[x] > numbers[x+1]:
numbers[x+1], numbers[x] = numbers[x], numbers[x+1]
swap_count += 1
print('%d %d') % (swap_count, check_sum(numbers))
bubble_in_array(raw_input().split())
More notes:
To swap two variables in Python, you dont need to use a temp variable, just use a,b = b,a.
In python 2.X, use xrange instead of range.

Categories