difference between all values in csv columns - python - python

At first, I'm sorry for my so poor level in python,
so I have the next problem:
1) I red a lot of answers at this resource, but nothing work for me
(np.abs(a.values[:,np.newaxis]-a2.values) and simple np.diff() and a lot of another ways)
2!) I have csv file the following form:
A 12 43 51 10 74
B 14 32 31 27 23
C 13 62 13 33 82
D 18 31 73 70 42
and I need receive residual between all columns in raws, so
A:12-43 12-51 12-10 12-74... 43-12 43-51 43-10 43-74...
B:12-43 12-51 12-10 12-74... 43-12 43-51 43-10 43-74...
after that I need power 2 in 12-43 12-51 12-10 12-74... 43-12 43-51 43-10 43-74...
I understand, that pandas good works with tables, but how I can to make this?
And if you can, please in what way I need go, that to cut-off 10% of extreme results? Thank you very much for you attention and feature help.

pandas doesn't easily accept arrays as elements, so numpy is a good help here.
First make all the differences, by line (axis=1) :
data="""
A 12 43 51 10 74
B 14 32 31 27 23
C 13 62 13 33 82
D 18 31 73 70 42
"""
pd.read_table(io.StringIO(data),header=None,index_col=0,sep=' ')
all_differences=np.apply_along_axis(lambda x:np.subtract.outer(x,x).ravel(),axis=1,arr=df)
Then sort for cut-off :
all_differences.sort(axis=1)
and select the good values, discarding the 0 resulting from L[i]-L[i].
n=df.shape[1]
cutoff =[i for i in range(n*n) if n*n*5//100<=i<n*(n-1)//2 or n*(n+1)//2<=i<n*n*95//100]
res=2.**all_differences[:,cutoff]

I suggest using numpy. For computing the difference you can do
>>> a = numpy.array([[12, 43, 51, 10, 74],
... [14, 32, 31, 27, 23],
... [13, 62, 13, 33, 82],
... [18, 31, 73, 70, 42]])
>>> difference_matrix = numpy.repeat(a, a.shape[-1], axis=-1) - numpy.tile(a, a.shape[-1])
>>> difference_matrix
array([[ 0, -31, -39, 2, -62, 31, 0, -8, 33, -31, 39, 8, 0,
41, -23, -2, -33, -41, 0, -64, 62, 31, 23, 64, 0],
[ 0, -18, -17, -13, -9, 18, 0, 1, 5, 9, 17, -1, 0,
4, 8, 13, -5, -4, 0, 4, 9, -9, -8, -4, 0],
[ 0, -49, 0, -20, -69, 49, 0, 49, 29, -20, 0, -49, 0,
-20, -69, 20, -29, 20, 0, -49, 69, 20, 69, 49, 0],
[ 0, -13, -55, -52, -24, 13, 0, -42, -39, -11, 55, 42, 0,
3, 31, 52, 39, -3, 0, 28, 24, 11, -31, -28, 0]])
If you want to square the result you can simply apply it to the matrix and each element will be squared:
>>> difference_matrix ** 2
array([[ 0, 961, 1521, 4, 3844, 961, 0, 64, 1089, 961, 1521,
64, 0, 1681, 529, 4, 1089, 1681, 0, 4096, 3844, 961,
529, 4096, 0],
[ 0, 324, 289, 169, 81, 324, 0, 1, 25, 81, 289,
1, 0, 16, 64, 169, 25, 16, 0, 16, 81, 81,
64, 16, 0],
[ 0, 2401, 0, 400, 4761, 2401, 0, 2401, 841, 400, 0,
2401, 0, 400, 4761, 400, 841, 400, 0, 2401, 4761, 400,
4761, 2401, 0],
[ 0, 169, 3025, 2704, 576, 169, 0, 1764, 1521, 121, 3025,
1764, 0, 9, 961, 2704, 1521, 9, 0, 784, 576, 121,
961, 784, 0]])

Related

Matrix row masking at different indices

I have an 2-D array that I want to find the max value per row and then find the next max-value that is not within +/- n of the previous value. For example I have the following matrix:
results =
array([[ 33, 108, 208, 96, 96, 112, 18, 208, 33, 323, 60, 42],
[ 51, 6, 39, 112, 160, 144, 342, 195, 27, 136, 42, 54],
[ 12, 176, 266, 162, 45, 70, 156, 198, 143, 56, 342, 130],
[ 22, 288, 304, 162, 21, 238, 156, 126, 165, 91, 144, 130],
[342, 120, 36, 51, 10, 128, 156, 272, 32, 98, 192, 288]])
row_max_index = results.argmax(1)
row_max_index #show max index
array([ 9, 6, 10, 2, 0])
Now I'd like to get the next max value not within say +/- 2 of the current max.
Here is what I have but it feels sloppy:
maskIndx = np.c_[row_max_index-2, row_max_index-1, row_max_index, row_max_index+1, row_max_index+2,]%12
maskIndx #show windowed index
array([[ 7, 8, 9, 10, 11],
[ 4, 5, 6, 7, 8],
[ 8, 9, 10, 11, 0],
[ 0, 1, 2, 3, 4],
[10, 11, 0, 1, 2]])
results[np.meshgrid(np.arange(5), np.arange(5))[1], maskIndx] = 0 #uses array indexing
results #show results
array([[ 33, 108, 208, 96, 96, 112, 18, 0, 0, 0, 0, 0],
[ 51, 6, 39, 112, 0, 0, 0, 0, 0, 136, 42, 54],
[ 0, 176, 266, 162, 45, 70, 156, 198, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 238, 156, 126, 165, 91, 144, 130],
[ 0, 0, 0, 51, 10, 128, 156, 272, 32, 98, 0, 0]])
next_max_index = results.argmax(1)
array([2, 9, 2, 5, 7])
Any ideas on doing this faster through indexing/windowing?
You can create a mask around the indices you compute for the max by taking an array of indices and subtracting the relevant max_indices, and then use the masked api to recompute the argmax:
import numpy as np
result = ... # Result here
row_max_index = results.argmax(axis=1, keepdims=True)
indices = np.arange(results.shape[1])
mask = np.abs(indices - row_max_index) <= 2
out = np.ma.array(results, mask=mask).argmax(axis=1)

How to handle negative values before CNN

I am going to generate my train and test datasets from an image representing volume values. This image contains a range of -25 to 75. I want to ignore the negative values in preprocessing step. Could anyone tell me how I should treat negative values? Is there any way to transfer the negative values to zero or no-data without changing the positive pixel values?
I can't advise on if this should be done, but if you want to turn all your negative values to 0 you can use tf.maximum:
import tensorflow as tf
x = tf.random.uniform((10, 10), -25, 75, dtype=tf.int32)
<tf.Tensor: shape=(10, 10), dtype=int32, numpy=
array([[ 57, -11, 48, 43, 29, 21, 15, 42, -9, 12],
[ 18, 67, -9, -21, 6, 27, 50, -1, 72, 51],
[ 2, 22, 70, 49, 50, -10, 67, 4, 59, -10],
[-13, 39, 60, -20, -15, -17, 51, 73, -23, 21],
[ 28, 8, 48, 66, -13, -3, 44, 35, 23, 45],
[-24, 30, 16, 25, 34, -13, 24, 49, 50, -10],
[-24, 25, -1, 35, 67, 45, 27, 6, 65, 4],
[ 20, -5, 41, -14, -10, 40, 21, 69, 13, 14],
[ 53, -2, 6, 0, -13, 28, 11, -11, 29, 17],
[ 15, 40, 61, 56, 3, 56, 12, -12, 19, 0]])>
Here's the magic:
tf.maximum(x, 0)
<tf.Tensor: shape=(10, 10), dtype=int32, numpy=
array([[57, 0, 48, 43, 29, 21, 15, 42, 0, 12],
[18, 67, 0, 0, 6, 27, 50, 0, 72, 51],
[ 2, 22, 70, 49, 50, 0, 67, 4, 59, 0],
[ 0, 39, 60, 0, 0, 0, 51, 73, 0, 21],
[28, 8, 48, 66, 0, 0, 44, 35, 23, 45],
[ 0, 30, 16, 25, 34, 0, 24, 49, 50, 0],
[ 0, 25, 0, 35, 67, 45, 27, 6, 65, 4],
[20, 0, 41, 0, 0, 40, 21, 69, 13, 14],
[53, 0, 6, 0, 0, 28, 11, 0, 29, 17],
[15, 40, 61, 56, 3, 56, 12, 0, 19, 0]])>

Vectorizing assignment of a tensor to a slice in PyTorch

I'm trying to vectorize a slice assignment of the form
for i in range(a.shape[1]):
for j in range(a.shape[2]):
a[:,i,j,:,i:i+b.shape[2],j:j+b.shape[3]] = b
where b itself is an array. This is because the nested Python loop is too inefficient and is taking up most of the runtime. Is there a way to do this?
For a simpler case, consider the following:
for i in range(a.shape[1]):
a[:,i,:,i:i+b.shape[2]] = b
This is what b and a might look like:
You can see the diagonal, "sliding" structure of the resulting matrix.
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a 0s padded version of the input and being a view would be efficient on memory and performance. More info on use of as_strided based view_as_windows.
Hence, for the simpler case, it would be -
from skimage.util.shape import view_as_windows
def sliding_2D_windows(b, outshp_axis1):
# outshp_axis1 is desired output's shape along axis=1
n = outshp_axis1-1
b1 = np.pad(b,((0,0),(0,0),(n,n)),'constant')
w_shp = (1,b1.shape[1],b.shape[2]+n)
return view_as_windows(b1,w_shp)[...,0,::-1,0,:,:]
Sample run -
In [192]: b
Out[192]:
array([[[54, 57, 74, 77],
[77, 19, 93, 31],
[46, 97, 80, 98]],
[[98, 22, 68, 75],
[49, 97, 56, 98],
[91, 47, 35, 87]]])
In [193]: sliding_2D_windows(b, outshp_axis1=3)
Out[193]:
array([[[[54, 57, 74, 77, 0, 0],
[77, 19, 93, 31, 0, 0],
[46, 97, 80, 98, 0, 0]],
[[ 0, 54, 57, 74, 77, 0],
[ 0, 77, 19, 93, 31, 0],
[ 0, 46, 97, 80, 98, 0]],
[[ 0, 0, 54, 57, 74, 77],
[ 0, 0, 77, 19, 93, 31],
[ 0, 0, 46, 97, 80, 98]]],
[[[98, 22, 68, 75, 0, 0],
[49, 97, 56, 98, 0, 0],
[91, 47, 35, 87, 0, 0]],
....
[[ 0, 0, 98, 22, 68, 75],
[ 0, 0, 49, 97, 56, 98],
[ 0, 0, 91, 47, 35, 87]]]])
Assuming b has a shape (2,3,x1), and a has a shape (2,x2-x1+1,3,x2). In your screenshot, we can infer that x1=4, x2=6.
import numpy as np
b_shape = (2,3,4)
a_shape = (2,3,3,6)
b = np.arange(1,25).reshape(b_shape)
#array([[[ 1, 2, 3, 4],
# [ 5, 6, 7, 8],
# [ 9, 10, 11, 12]],
#
# [[13, 14, 15, 16],
# [17, 18, 19, 20],
# [21, 22, 23, 24]]])
c = np.pad(b, (*[(0,0) for _ in range(len(b_shape[:-1]))], (0,a_shape[-1]-b_shape[-1])), 'constant')
#array([[[ 1, 2, 3, 4, 0, 0],
# [ 5, 6, 7, 8, 0, 0],
# [ 9, 10, 11, 12, 0, 0]],
#
# [[13, 14, 15, 16, 0, 0],
# [17, 18, 19, 20, 0, 0],
# [21, 22, 23, 24, 0, 0]]])
a = np.stack([np.roll(c, shift=i) for i in range(a_shape[-1]-b_shape[-1]+1)], axis=1)
# array([[[[ 1, 2, 3, 4, 0, 0],
# [ 5, 6, 7, 8, 0, 0],
# [ 9, 10, 11, 12, 0, 0]],
# [[ 0, 1, 2, 3, 4, 0],
# [ 0, 5, 6, 7, 8, 0],
# [ 0, 9, 10, 11, 12, 0]],
# [[ 0, 0, 1, 2, 3, 4],
# [ 0, 0, 5, 6, 7, 8],
# [ 0, 0, 9, 10, 11, 12]]],
# [[[13, 14, 15, 16, 0, 0],
# [17, 18, 19, 20, 0, 0],
# [21, 22, 23, 24, 0, 0]],
# [[ 0, 13, 14, 15, 16, 0],
# [ 0, 17, 18, 19, 20, 0],
# [ 0, 21, 22, 23, 24, 0]],
# [[ 0, 0, 13, 14, 15, 16],
# [ 0, 0, 17, 18, 19, 20],
# [ 0, 0, 21, 22, 23, 24]]]])

Identify blocks/groups of words in a document based on a key word and positional data?

Considering the we have following input data table.
import pandas as pd
#Pandas settings to see all the data when printing
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.width', 500)
#Load the data
data_array = [[576, 60, 279, 28, 2, 'LzR', 0, 0], [578, 17, 318, 23, 3, 'U', 0, 0], [371, 21, 279, 24, 2, 'K', 0, 0], [373, 134, 317, 25, 3, 'mq77MJc', 0, 0], [537, 32, 317, 25, 3, '53', 0, 0], [373, 201, 355, 25, 4, '7Q7NZzkAzN', 0, 0], [538, 118, 393, 24, 5, 'oNNbgA', 0, 0], [680, 39, 392, 26, 5, 'J9', 0, 0], [1509, 155, 260, 154, 2, 'd', 0, 0], [1731, 98, 268, 123, 2, 'z8', 0, 0], [1876, 385, 271, 120, 2, 'rUqNDY', 0, 0], [1640, 197, 590, 21, 7, 't5gNVHDXQVJ', 0, 0], [1989, 270, 589, 22, 7, 't3I81fBOE9caUfb', 0, 0], [352, 80, 645, 25, 8, 'i5f3', 0, 1], [454, 245, 645, 25, 8, 'KrqcRA7Se7X7', 1, 1], [719, 60, 645, 27, 8, 'bpN', 0, 1], [1640, 161, 642, 22, 8, 'skAzt6Np4', 0, 0], [1822, 51, 643, 21, 8, 'K59', 0, 0], [2082, 177, 642, 22, 8, 'cwyN7wsMhE', 0, 0], [353, 220, 683, 25, 9, 'O8coFUwMUbE', 0, 1], [597, 17, 683, 25, 9, 'L', 0, 1], [1640, 234, 695, 22, 9, 'oVWEKowWnbT2y', 0, 0], [2080, 179, 695, 22, 9, 'FvjigCiC7h', 0, 0], [351, 79, 721, 24, 10, 'OQN3', 0, 1], [476, 202, 720, 25, 10, 'S2gcfJIDze', 0, 1], [2062, 69, 775, 22, 11, 'n9lN', 0, 0], [2155, 8, 775, 21, 11, 'G', 0, 0], [2188, 35, 775, 21, 11, '9X', 0, 0], [2246, 8, 775, 21, 11, 'v', 0, 0], [353, 81, 1003, 21, 13, 'c7ox8', 0, 0], [461, 325, 1003, 22, 13, 'o9GmMYAW4RrpPBY64p', 0, 0], [351, 101, 1037, 22, 14, '9NF7ii', 0, 0], [477, 146, 1037, 21, 14, 'MwTlIkU9', 0, 0], [350, 70, 1071, 22, 15, 'J5XF', 0, 0], [443, 87, 1071, 22, 15, '3m4tM', 0, 0], [553, 32, 1071, 22, 15, 'Ck', 0, 0], [609, 10, 1071, 22, 15, '5', 0, 0], [643, 53, 1071, 22, 15, 'X7Y', 0, 0], [1568, 135, 1092, 20, 16, 'P4', 0, 0], [352, 142, 1105, 22, 16, 'Pjs1GYSG', 0, 0], [516, 45, 1105, 22, 16, 'o9V', 0, 0], [588, 106, 1105, 22, 16, 'WRI8oY', 0, 0], [1563, 132, 1117, 20, 16, '3cZY', 0, 0], [350, 69, 1140, 21, 17, 'GW3y', 0, 0], [441, 35, 1139, 22, 17, 'EO', 0, 0], [497, 51, 1139, 28, 17, 'toN', 0, 0], [570, 49, 1140, 21, 17, 'k11', 0, 0], [643, 51, 1139, 22, 17, 'pod', 0, 0], [715, 89, 1140, 21, 17, '6SQfv', 0, 0], [825, 83, 1139, 22, 17, 'CzC2M', 0, 0], [934, 102, 1140, 21, 17, 'aowjQC', 0, 0], [1062, 51, 1140, 21, 17, 'BtC', 0, 0], [1558, 136, 1142, 20, 17, 'XhJ', 0, 0], [1722, 336, 1115, 25, 16, 'OgtXP2nxOwP7Gb3I', 0, 0], [352, 125, 1174, 21, 18, 'zYmvutc', 0, 0], [498, 45, 1174, 21, 18, 'JvN', 0, 0], [570, 124, 1174, 21, 18, 'TyZdJG4', 0, 0], [352, 64, 1207, 22, 19, 'Lvam', 0, 0], [443, 45, 1208, 21, 19, 'Onk', 0, 0], [516, 123, 1208, 21, 19, 'bPgi7tF', 0, 0], [1946, 12, 1231, 11, 20, 'I', 0, 0], [351, 106, 1241, 23, 20, 'xbAa7n', 0, 0], [479, 306, 1242, 22, 20, 'NEn7uifO17vkyzVVp', 0, 0], [1300, 142, 1242, 27, 20, 'dZukserV', 0, 0], [352, 178, 1275, 34, 21, 'qrxWKyJjjn', 0, 0], [557, 60, 1275, 28, 21, '2Ri5', 0, 0], [1354, 88, 1276, 27, 21, 'ZCp3F', 0, 0], [1558, 197, 1231, 63, 20, 'YgoGs', 0, 0], [1787, 96, 1247, 63, 20, 'Um', 0, 0], [1913, 268, 1231, 63, 20, 'YL7fkaV', 0, 0], [351, 70, 1309, 23, 22, 'kcGD', 0, 0], [443, 142, 1309, 23, 22, 'lGAx6Ljx', 0, 0], [605, 35, 1310, 21, 22, 'Hm', 0, 0], [661, 142, 1310, 27, 22, 'S8gZ5tPE', 0, 0], [1302, 135, 1310, 27, 22, 'gjgVPImz', 0, 0], [1743, 12, 1329, 11, 23, 'Z', 0, 0], [2055, 16, 1324, 17, 23, 'i', 0, 0], [353, 11, 1344, 21, 24, 'L', 0, 0], [386, 53, 1344, 21, 24, 'Q5J', 0, 0], [1300, 142, 1344, 27, 24, '9L9ScEj2', 0, 0], [1558, 400, 1345, 63, 24, 'S8YyUDnXd', 0, 0], [1993, 91, 1345, 62, 24, '4P', 0, 0], [1555, 102, 1605, 35, 25, 'kbGP', 0, 2], [1674, 371, 1605, 44, 25, 'DO1tvoEyiX9AVz6Q', 0, 2], [2062, 147, 1605, 44, 25, 'DtQAa3', 2, 2], [1554, 53, 1669, 35, 26, 'pg', 0, 2], [1624, 104, 1660, 34, 26, 'ZPsJ', 0, 2], [1746, 221, 1659, 38, 26, '7CBPYAUA', 0, 2], [1987, 50, 1657, 46, 26, 'AL', 0, 2], [1555, 407, 1714, 44, 27, 'LA3ShdHUE3DAoOkfiB', 0, 2], [188, 1826, 2340, 3, 29, '4', 0, 0], [2024, 217, 2309, 34, 28, 'DLpZXhKepjdcyW', 0, 0], [2239, 119, 2310, 33, 28, '28otEfj9', 0, 0], [230, 77, 2349, 23, 29, 'Th1YC4R', 0, 0], [476, 89, 2349, 18, 29, 'uFRt5qEx', 0, 0], [1140, 463, 2388, 35, 30, 'Mxcsoj1MOubuEB33', 0, 0], [1708, 40, 2372, 17, 30, 'OfA', 0, 9], [1758, 81, 2372, 22, 30, 'ZQoO7mwr', 0, 9], [1848, 3, 2372, 17, 30, 'M', 0, 9], [1860, 134, 2372, 22, 30, 'IvtUnQ4Zxc29A', 0, 9], [2002, 20, 2376, 13, 30, '3V', 0, 9], [2029, 32, 2372, 17, 30, '6t8', 0, 9], [2070, 133, 2372, 17, 30, 'PdCWscuWGHR', 0, 9], [1709, 171, 2398, 22, 30, 'RsW4Oj1Lhf1ljQV4G', 0, 9], [1890, 148, 2398, 22, 30, 'VSUJUa3tuYIhiXxP', 9, 9], [2048, 34, 2398, 17, 30, 'aAm', 0, 9], [2089, 21, 2403, 12, 30, 'uY', 0, 9], [2118, 53, 2398, 17, 30, '6DDFv', 0, 9], [2179, 28, 2398, 17, 30, 'DKJ', 0, 9], [2214, 66, 2398, 17, 30, 'NBmY9BD', 0, 9], [2289, 57, 2398, 18, 30, 'sYsrT', 0, 9], [1708, 25, 2425, 17, 31, 'jGk', 0, 9], [1736, 34, 2429, 13, 31, 'oX', 0, 9], [1778, 93, 2425, 17, 31, 'OvpfEyhHso', 0, 9], [120, 131, 2510, 23, 32, 'rZCsYsA6im2b', 0, 0], [260, 25, 2515, 18, 32, 'G6', 0, 0], [295, 107, 2510, 18, 32, 'd6eYwhzZuS', 0, 0], [132, 88, 2582, 22, 34, 'Xc84', 3, 3], [231, 223, 2582, 22, 34, 'MnMcBUHVmhl2', 0, 3], [463, 47, 2582, 22, 34, 'Vto', 0, 3], [132, 194, 2616, 22, 35, 'B4f1f4KpCHC', 0, 3], [338, 14, 2616, 22, 35, 'W', 0, 3], [131, 64, 2650, 22, 36, 'UW6t', 0, 3], [216, 181, 2650, 22, 36, 'hLULWi7xdj', 0, 3], [1044, 175, 2510, 18, 32, 'F9f7jvsfmjnXbK', 0, 0], [1226, 25, 2515, 18, 32, 'Vk', 0, 0], [1261, 177, 2510, 23, 32, 'TBlYLSoItzHKpG', 0, 0], [1054, 132, 2544, 22, 33, 'u4vvPgHd', 0, 0], [1053, 36, 2590, 21, 34, 'lN', 0, 4], [1101, 107, 2589, 23, 34, 'ieee4D', 0, 4], [1218, 47, 2589, 23, 34, 'kD6', 0, 4], [1054, 122, 2623, 23, 35, 'Ngf2xWa', 0, 4], [1189, 132, 2624, 22, 35, 'N27RyHsP', 0, 4], [1054, 204, 2657, 23, 36, 'e97JFxWTXfS', 0, 4], [1262, 43, 2658, 22, 36, 'p', 4, 4], [1054, 65, 2692, 22, 37, 'mle1', 0, 4], [1139, 186, 2691, 23, 37, 'o6tA5wFrK', 0, 4], [1337, 39, 2691, 23, 37, 'W3', 0, 4], [1709, 175, 2510, 18, 32, 'DQm27gIhcjmkdB', 0, 0], [1892, 25, 2515, 18, 32, '4Z', 0, 0], [1927, 176, 2510, 23, 32, 'rAP1PxzMyqkxdY', 0, 0], [1720, 132, 2544, 22, 33, 'JpsQeikW', 0, 0], [1719, 35, 2590, 21, 34, 'hD', 0, 5], [1766, 107, 2589, 23, 34, '3vzIwR', 0, 5], [1884, 47, 2589, 23, 34, 'kHw', 0, 5], [1720, 122, 2623, 23, 35, 'MYOKedL', 0, 5], [1854, 132, 2624, 22, 35, 'K8JXFVII', 5, 5], [1720, 204, 2657, 23, 36, 'bBkPRmgyfVp', 0, 5], [1928, 43, 2658, 22, 36, 'j', 0, 5], [1719, 65, 2692, 22, 37, 'RfU4', 0, 5], [1805, 185, 2691, 23, 37, 'wtK1L23Q4', 0, 5], [2003, 38, 2692, 22, 37, 'yY', 0, 5], [130, 255, 2804, 23, 38, 'jgoGjNh2DoLnb2b4PGonGvU', 0, 0], [1044, 117, 2804, 18, 38, 'qGXS7f7gRHy', 0, 0], [1168, 38, 2804, 18, 38, 'UQI', 0, 0], [1215, 102, 2804, 18, 38, 'P764bscKkx', 0, 0], [1320, 38, 2804, 18, 38, 'OtH', 0, 0], [1368, 58, 2804, 18, 38, 'VhrUJ', 0, 0], [1709, 100, 2804, 23, 38, 'zjQgoufCGU', 0, 0], [131, 55, 2852, 21, 40, 'piH', 0, 0], [198, 41, 2858, 15, 40, 'wU6P', 0, 0], [281, 124, 2852, 21, 40, 'riQCT4RX', 0, 0], [454, 138, 2852, 27, 40, 'jSAJPlWhyRE', 0, 0], [612, 77, 2852, 21, 40, 'nVS97', 0, 0], [131, 227, 2886, 21, 41, 'zExU7Poi4QW', 0, 0], [375, 235, 2886, 21, 41, 'pLTfHVP1qzb7Mh2', 0, 0], [138, 100, 2957, 15, 42, 'fv8', 0, 0], [1404, 4, 2978, 4, 42, 'B', 0, 0], [130, 103, 2975, 34, 42, 'qpg', 0, 0], [253, 252, 2974, 19, 42, 'T9SOmYWl4CUrdt8o', 0, 0], [1078, 3, 2972, 40, 42, 'S5', 0, 0], [1103, 62, 2978, 28, 42, 'L6W', 0, 0], [1181, 56, 2978, 28, 42, 'ep1', 0, 0], [1253, 118, 2978, 28, 42, 'oKhrqlI', 0, 0], [1384, 45, 2985, 21, 42, 'OyP', 0, 0], [1444, 132, 2978, 28, 42, 'mvg8Bw5', 0, 0], [1593, 55, 2972, 76, 42, 'eG', 0, 0], [218, 5, 3074, 18, 44, 'z', 0, 0], [231, 72, 3058, 18, 44, 'x1Pat7', 0, 0], [605, 5, 3074, 18, 44, 'P', 0, 0], [617, 39, 3058, 18, 44, 'dNT', 0, 0], [1053, 146, 3058, 23, 44, 'q7CLeOJhnI1oa', 0, 0], [1802, 5, 3074, 18, 44, '6', 0, 0], [1815, 72, 3058, 18, 44, 'acKa9h', 0, 0], [2119, 50, 3057, 35, 44, 'uGH', 0, 0], [461, 129, 3125, 29, 45, 'p6L5U', 0, 0], [623, 44, 3125, 29, 45, 'dC', 0, 0], [1046, 266, 3125, 29, 45, '9HBoqUyRbg', 0, 0], [1975, 129, 3125, 29, 45, 'qH1ph', 0, 0], [2136, 45, 3125, 29, 45, 'gG', 0, 0], [218, 5, 3183, 20, 46, 'j', 0, 0], [605, 5, 3183, 20, 46, 'o', 0, 0], [119, 24, 3213, 18, 47, 'QDN', 0, 8], [153, 94, 3213, 18, 47, 'EleVpvP4', 0, 8], [256, 105, 3213, 23, 47, 'dq9L2xQO7', 0, 8], [370, 7, 3223, 2, 47, 'n', 0, 8], [386, 69, 3212, 24, 47, 'L9EKl', 0, 8], [464, 83, 3213, 23, 47, 'AnF2rBIN', 0, 8], [555, 19, 3214, 17, 47, 'k6', 0, 8], [582, 62, 3213, 18, 47, 'y3M3kx', 8, 8], [654, 2, 3213, 18, 47, '1', 0, 8], [666, 139, 3212, 19, 47, 'SkmavPFrrrSv', 0, 8], [808, 52, 3213, 18, 47, 'bJ5S', 0, 8], [200, 100, 3316, 29, 50, 'NmNa', 0, 7], [336, 675, 3316, 29, 50, 'vB759g8XWkL7XXe5tCHZs7tAF', 7, 7], [1046, 42, 3203, 23, 47, 'v4T', 0, 0], [1095, 150, 3202, 19, 47, 'NH7vM6', 0, 0], [1251, 24, 3199, 22, 47, '47', 0, 0], [1802, 5, 3183, 20, 46, 'B', 0, 0], [2119, 5, 3183, 20, 46, 'b', 0, 0], [1714, 254, 3213, 23, 47, '2Za9eGyQyKp4S2rVYahzJNM', 0, 0], [1715, 55, 3261, 21, 48, 'djv', 0, 6], [1781, 41, 3267, 15, 48, '3WHD', 0, 6], [1864, 124, 3261, 21, 48, '8ucAV2oj', 0, 6], [2037, 139, 3261, 27, 48, 'baUoLawp6rY', 0, 6], [2196, 76, 3261, 21, 48, 'sRheu', 6, 6], [1715, 226, 3295, 21, 49, 'hAfhkKsI7Jx', 0, 6], [1959, 234, 3295, 21, 49, 'quecbSW4gEdjSGG', 0, 6], [1715, 176, 3329, 27, 50, 'ciaZR8NxiuEXr1', 0, 6], [1910, 140, 3329, 21, 50, 'vicUyHPNcN', 0, 6]]
data_pd = pd.DataFrame(data_array,columns=["left", "width", "top", "height", "lineNr", "randomWord", "keyWord", "keyWordGroup"])
print(data_pd)
The table contains a main column randomWord and a few other columns with positional coordinates of each word within the document.
To help visualize the data. I wrote this code which makes an image out of the table for a better visualization and understanding of the problem
from PIL import Image, ImageFont, ImageDraw # pip install Pillow
import random
#Create a empty image object
new_im = Image.new('RGB', ((data_pd["left"]+data_pd["width"]).max() + data_pd["left"].min(), (data_pd["top"]+data_pd["height"]).max() + data_pd["top"].min() ), (255,255,255))
draw_new_im = ImageDraw.Draw(new_im)
#Create a dictioanry with random colors to assign for each uniq keyWordGroup
uniqGroups = data_pd["keyWordGroup"].unique()
colors = {}
for g in uniqGroups:
if(g == 0):
colors[str(g)] = "black" # assign black color for non groups
else:
colors[str(g)] = "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) #generate random color
#Write text to the image
for i, row in data_pd.iterrows():
draw_new_im.text((int(row["left"]), int(row["top"])), str(row["randomWord"]) , fill=colors[str(row["keyWordGroup"])],font=ImageFont.truetype("arial.ttf", int(row["height"])))
#Save the image
new_im.save("TestImage.jpg")
As you can see we have the keyWord column. This column contains the IDs of some key words that we need to find the closest the block/group of text they belongs to.
The question of this post is the following: How can we identify the group/block of text closest to the key word in the keyWord column ? As you see in the image generated, for each keyWord ID we find all the words which are in the proximity and forming a block of text.
The output that i am looking for is in column keyWordGroup which is an example of which words were assign to the key words.
Is there any method we could use to find these block of text based on the keywords and the rest of the positional data given?
The solution comprises two steps:
Group words to closest keyword (I wouldn't call it clustering as the centers of the groups are already given here, as opposed to clustering where you try to find clusters with no a priori known locations)
Remove outliers that don't seem to really belong to this keyword, although this keywords is the closest by distance.
Grouping is straightforward by assigning keyword numbers by distance using vector quantization. The only thing we have to bear in mind here is that the keyword numbers in the original dataframe don't appear in ordered sequence, but in vq groups are numbered in sequence starting with 0. That's why we have to map the new keyword group numbers to the given keyword numbers.
Removing outliers can be done in different ways, and there were no strict requirements in the question on how the keyword groups should be formed. I chose a very simple approach: take the mean and standard deviation of the distances from the keyword to all keyword group members and consider words with distances greater than Mean + x * StdDev as outliers. A choice of x = 1.5 gives good results.
import pandas as pd
#Load the data
data_array = [[576, 60, 279, 28, 2, 'LzR', 0, 0], [578, 17, 318, 23, 3, 'U', 0, 0], [371, 21, 279, 24, 2, 'K', 0, 0], [373, 134, 317, 25, 3, 'mq77MJc', 0, 0], [537, 32, 317, 25, 3, '53', 0, 0], [373, 201, 355, 25, 4, '7Q7NZzkAzN', 0, 0], [538, 118, 393, 24, 5, 'oNNbgA', 0, 0], [680, 39, 392, 26, 5, 'J9', 0, 0], [1509, 155, 260, 154, 2, 'd', 0, 0], [1731, 98, 268, 123, 2, 'z8', 0, 0], [1876, 385, 271, 120, 2, 'rUqNDY', 0, 0], [1640, 197, 590, 21, 7, 't5gNVHDXQVJ', 0, 0], [1989, 270, 589, 22, 7, 't3I81fBOE9caUfb', 0, 0], [352, 80, 645, 25, 8, 'i5f3', 0, 1], [454, 245, 645, 25, 8, 'KrqcRA7Se7X7', 1, 1], [719, 60, 645, 27, 8, 'bpN', 0, 1], [1640, 161, 642, 22, 8, 'skAzt6Np4', 0, 0], [1822, 51, 643, 21, 8, 'K59', 0, 0], [2082, 177, 642, 22, 8, 'cwyN7wsMhE', 0, 0], [353, 220, 683, 25, 9, 'O8coFUwMUbE', 0, 1], [597, 17, 683, 25, 9, 'L', 0, 1], [1640, 234, 695, 22, 9, 'oVWEKowWnbT2y', 0, 0], [2080, 179, 695, 22, 9, 'FvjigCiC7h', 0, 0], [351, 79, 721, 24, 10, 'OQN3', 0, 1], [476, 202, 720, 25, 10, 'S2gcfJIDze', 0, 1], [2062, 69, 775, 22, 11, 'n9lN', 0, 0], [2155, 8, 775, 21, 11, 'G', 0, 0], [2188, 35, 775, 21, 11, '9X', 0, 0], [2246, 8, 775, 21, 11, 'v', 0, 0], [353, 81, 1003, 21, 13, 'c7ox8', 0, 0], [461, 325, 1003, 22, 13, 'o9GmMYAW4RrpPBY64p', 0, 0], [351, 101, 1037, 22, 14, '9NF7ii', 0, 0], [477, 146, 1037, 21, 14, 'MwTlIkU9', 0, 0], [350, 70, 1071, 22, 15, 'J5XF', 0, 0], [443, 87, 1071, 22, 15, '3m4tM', 0, 0], [553, 32, 1071, 22, 15, 'Ck', 0, 0], [609, 10, 1071, 22, 15, '5', 0, 0], [643, 53, 1071, 22, 15, 'X7Y', 0, 0], [1568, 135, 1092, 20, 16, 'P4', 0, 0], [352, 142, 1105, 22, 16, 'Pjs1GYSG', 0, 0], [516, 45, 1105, 22, 16, 'o9V', 0, 0], [588, 106, 1105, 22, 16, 'WRI8oY', 0, 0], [1563, 132, 1117, 20, 16, '3cZY', 0, 0], [350, 69, 1140, 21, 17, 'GW3y', 0, 0], [441, 35, 1139, 22, 17, 'EO', 0, 0], [497, 51, 1139, 28, 17, 'toN', 0, 0], [570, 49, 1140, 21, 17, 'k11', 0, 0], [643, 51, 1139, 22, 17, 'pod', 0, 0], [715, 89, 1140, 21, 17, '6SQfv', 0, 0], [825, 83, 1139, 22, 17, 'CzC2M', 0, 0], [934, 102, 1140, 21, 17, 'aowjQC', 0, 0], [1062, 51, 1140, 21, 17, 'BtC', 0, 0], [1558, 136, 1142, 20, 17, 'XhJ', 0, 0], [1722, 336, 1115, 25, 16, 'OgtXP2nxOwP7Gb3I', 0, 0], [352, 125, 1174, 21, 18, 'zYmvutc', 0, 0], [498, 45, 1174, 21, 18, 'JvN', 0, 0], [570, 124, 1174, 21, 18, 'TyZdJG4', 0, 0], [352, 64, 1207, 22, 19, 'Lvam', 0, 0], [443, 45, 1208, 21, 19, 'Onk', 0, 0], [516, 123, 1208, 21, 19, 'bPgi7tF', 0, 0], [1946, 12, 1231, 11, 20, 'I', 0, 0], [351, 106, 1241, 23, 20, 'xbAa7n', 0, 0], [479, 306, 1242, 22, 20, 'NEn7uifO17vkyzVVp', 0, 0], [1300, 142, 1242, 27, 20, 'dZukserV', 0, 0], [352, 178, 1275, 34, 21, 'qrxWKyJjjn', 0, 0], [557, 60, 1275, 28, 21, '2Ri5', 0, 0], [1354, 88, 1276, 27, 21, 'ZCp3F', 0, 0], [1558, 197, 1231, 63, 20, 'YgoGs', 0, 0], [1787, 96, 1247, 63, 20, 'Um', 0, 0], [1913, 268, 1231, 63, 20, 'YL7fkaV', 0, 0], [351, 70, 1309, 23, 22, 'kcGD', 0, 0], [443, 142, 1309, 23, 22, 'lGAx6Ljx', 0, 0], [605, 35, 1310, 21, 22, 'Hm', 0, 0], [661, 142, 1310, 27, 22, 'S8gZ5tPE', 0, 0], [1302, 135, 1310, 27, 22, 'gjgVPImz', 0, 0], [1743, 12, 1329, 11, 23, 'Z', 0, 0], [2055, 16, 1324, 17, 23, 'i', 0, 0], [353, 11, 1344, 21, 24, 'L', 0, 0], [386, 53, 1344, 21, 24, 'Q5J', 0, 0], [1300, 142, 1344, 27, 24, '9L9ScEj2', 0, 0], [1558, 400, 1345, 63, 24, 'S8YyUDnXd', 0, 0], [1993, 91, 1345, 62, 24, '4P', 0, 0], [1555, 102, 1605, 35, 25, 'kbGP', 0, 2], [1674, 371, 1605, 44, 25, 'DO1tvoEyiX9AVz6Q', 0, 2], [2062, 147, 1605, 44, 25, 'DtQAa3', 2, 2], [1554, 53, 1669, 35, 26, 'pg', 0, 2], [1624, 104, 1660, 34, 26, 'ZPsJ', 0, 2], [1746, 221, 1659, 38, 26, '7CBPYAUA', 0, 2], [1987, 50, 1657, 46, 26, 'AL', 0, 2], [1555, 407, 1714, 44, 27, 'LA3ShdHUE3DAoOkfiB', 0, 2], [188, 1826, 2340, 3, 29, '4', 0, 0], [2024, 217, 2309, 34, 28, 'DLpZXhKepjdcyW', 0, 0], [2239, 119, 2310, 33, 28, '28otEfj9', 0, 0], [230, 77, 2349, 23, 29, 'Th1YC4R', 0, 0], [476, 89, 2349, 18, 29, 'uFRt5qEx', 0, 0], [1140, 463, 2388, 35, 30, 'Mxcsoj1MOubuEB33', 0, 0], [1708, 40, 2372, 17, 30, 'OfA', 0, 9], [1758, 81, 2372, 22, 30, 'ZQoO7mwr', 0, 9], [1848, 3, 2372, 17, 30, 'M', 0, 9], [1860, 134, 2372, 22, 30, 'IvtUnQ4Zxc29A', 0, 9], [2002, 20, 2376, 13, 30, '3V', 0, 9], [2029, 32, 2372, 17, 30, '6t8', 0, 9], [2070, 133, 2372, 17, 30, 'PdCWscuWGHR', 0, 9], [1709, 171, 2398, 22, 30, 'RsW4Oj1Lhf1ljQV4G', 0, 9], [1890, 148, 2398, 22, 30, 'VSUJUa3tuYIhiXxP', 9, 9], [2048, 34, 2398, 17, 30, 'aAm', 0, 9], [2089, 21, 2403, 12, 30, 'uY', 0, 9], [2118, 53, 2398, 17, 30, '6DDFv', 0, 9], [2179, 28, 2398, 17, 30, 'DKJ', 0, 9], [2214, 66, 2398, 17, 30, 'NBmY9BD', 0, 9], [2289, 57, 2398, 18, 30, 'sYsrT', 0, 9], [1708, 25, 2425, 17, 31, 'jGk', 0, 9], [1736, 34, 2429, 13, 31, 'oX', 0, 9], [1778, 93, 2425, 17, 31, 'OvpfEyhHso', 0, 9], [120, 131, 2510, 23, 32, 'rZCsYsA6im2b', 0, 0], [260, 25, 2515, 18, 32, 'G6', 0, 0], [295, 107, 2510, 18, 32, 'd6eYwhzZuS', 0, 0], [132, 88, 2582, 22, 34, 'Xc84', 3, 3], [231, 223, 2582, 22, 34, 'MnMcBUHVmhl2', 0, 3], [463, 47, 2582, 22, 34, 'Vto', 0, 3], [132, 194, 2616, 22, 35, 'B4f1f4KpCHC', 0, 3], [338, 14, 2616, 22, 35, 'W', 0, 3], [131, 64, 2650, 22, 36, 'UW6t', 0, 3], [216, 181, 2650, 22, 36, 'hLULWi7xdj', 0, 3], [1044, 175, 2510, 18, 32, 'F9f7jvsfmjnXbK', 0, 0], [1226, 25, 2515, 18, 32, 'Vk', 0, 0], [1261, 177, 2510, 23, 32, 'TBlYLSoItzHKpG', 0, 0], [1054, 132, 2544, 22, 33, 'u4vvPgHd', 0, 0], [1053, 36, 2590, 21, 34, 'lN', 0, 4], [1101, 107, 2589, 23, 34, 'ieee4D', 0, 4], [1218, 47, 2589, 23, 34, 'kD6', 0, 4], [1054, 122, 2623, 23, 35, 'Ngf2xWa', 0, 4], [1189, 132, 2624, 22, 35, 'N27RyHsP', 0, 4], [1054, 204, 2657, 23, 36, 'e97JFxWTXfS', 0, 4], [1262, 43, 2658, 22, 36, 'p', 4, 4], [1054, 65, 2692, 22, 37, 'mle1', 0, 4], [1139, 186, 2691, 23, 37, 'o6tA5wFrK', 0, 4], [1337, 39, 2691, 23, 37, 'W3', 0, 4], [1709, 175, 2510, 18, 32, 'DQm27gIhcjmkdB', 0, 0], [1892, 25, 2515, 18, 32, '4Z', 0, 0], [1927, 176, 2510, 23, 32, 'rAP1PxzMyqkxdY', 0, 0], [1720, 132, 2544, 22, 33, 'JpsQeikW', 0, 0], [1719, 35, 2590, 21, 34, 'hD', 0, 5], [1766, 107, 2589, 23, 34, '3vzIwR', 0, 5], [1884, 47, 2589, 23, 34, 'kHw', 0, 5], [1720, 122, 2623, 23, 35, 'MYOKedL', 0, 5], [1854, 132, 2624, 22, 35, 'K8JXFVII', 5, 5], [1720, 204, 2657, 23, 36, 'bBkPRmgyfVp', 0, 5], [1928, 43, 2658, 22, 36, 'j', 0, 5], [1719, 65, 2692, 22, 37, 'RfU4', 0, 5], [1805, 185, 2691, 23, 37, 'wtK1L23Q4', 0, 5], [2003, 38, 2692, 22, 37, 'yY', 0, 5], [130, 255, 2804, 23, 38, 'jgoGjNh2DoLnb2b4PGonGvU', 0, 0], [1044, 117, 2804, 18, 38, 'qGXS7f7gRHy', 0, 0], [1168, 38, 2804, 18, 38, 'UQI', 0, 0], [1215, 102, 2804, 18, 38, 'P764bscKkx', 0, 0], [1320, 38, 2804, 18, 38, 'OtH', 0, 0], [1368, 58, 2804, 18, 38, 'VhrUJ', 0, 0], [1709, 100, 2804, 23, 38, 'zjQgoufCGU', 0, 0], [131, 55, 2852, 21, 40, 'piH', 0, 0], [198, 41, 2858, 15, 40, 'wU6P', 0, 0], [281, 124, 2852, 21, 40, 'riQCT4RX', 0, 0], [454, 138, 2852, 27, 40, 'jSAJPlWhyRE', 0, 0], [612, 77, 2852, 21, 40, 'nVS97', 0, 0], [131, 227, 2886, 21, 41, 'zExU7Poi4QW', 0, 0], [375, 235, 2886, 21, 41, 'pLTfHVP1qzb7Mh2', 0, 0], [138, 100, 2957, 15, 42, 'fv8', 0, 0], [1404, 4, 2978, 4, 42, 'B', 0, 0], [130, 103, 2975, 34, 42, 'qpg', 0, 0], [253, 252, 2974, 19, 42, 'T9SOmYWl4CUrdt8o', 0, 0], [1078, 3, 2972, 40, 42, 'S5', 0, 0], [1103, 62, 2978, 28, 42, 'L6W', 0, 0], [1181, 56, 2978, 28, 42, 'ep1', 0, 0], [1253, 118, 2978, 28, 42, 'oKhrqlI', 0, 0], [1384, 45, 2985, 21, 42, 'OyP', 0, 0], [1444, 132, 2978, 28, 42, 'mvg8Bw5', 0, 0], [1593, 55, 2972, 76, 42, 'eG', 0, 0], [218, 5, 3074, 18, 44, 'z', 0, 0], [231, 72, 3058, 18, 44, 'x1Pat7', 0, 0], [605, 5, 3074, 18, 44, 'P', 0, 0], [617, 39, 3058, 18, 44, 'dNT', 0, 0], [1053, 146, 3058, 23, 44, 'q7CLeOJhnI1oa', 0, 0], [1802, 5, 3074, 18, 44, '6', 0, 0], [1815, 72, 3058, 18, 44, 'acKa9h', 0, 0], [2119, 50, 3057, 35, 44, 'uGH', 0, 0], [461, 129, 3125, 29, 45, 'p6L5U', 0, 0], [623, 44, 3125, 29, 45, 'dC', 0, 0], [1046, 266, 3125, 29, 45, '9HBoqUyRbg', 0, 0], [1975, 129, 3125, 29, 45, 'qH1ph', 0, 0], [2136, 45, 3125, 29, 45, 'gG', 0, 0], [218, 5, 3183, 20, 46, 'j', 0, 0], [605, 5, 3183, 20, 46, 'o', 0, 0], [119, 24, 3213, 18, 47, 'QDN', 0, 8], [153, 94, 3213, 18, 47, 'EleVpvP4', 0, 8], [256, 105, 3213, 23, 47, 'dq9L2xQO7', 0, 8], [370, 7, 3223, 2, 47, 'n', 0, 8], [386, 69, 3212, 24, 47, 'L9EKl', 0, 8], [464, 83, 3213, 23, 47, 'AnF2rBIN', 0, 8], [555, 19, 3214, 17, 47, 'k6', 0, 8], [582, 62, 3213, 18, 47, 'y3M3kx', 8, 8], [654, 2, 3213, 18, 47, '1', 0, 8], [666, 139, 3212, 19, 47, 'SkmavPFrrrSv', 0, 8], [808, 52, 3213, 18, 47, 'bJ5S', 0, 8], [200, 100, 3316, 29, 50, 'NmNa', 0, 7], [336, 675, 3316, 29, 50, 'vB759g8XWkL7XXe5tCHZs7tAF', 7, 7], [1046, 42, 3203, 23, 47, 'v4T', 0, 0], [1095, 150, 3202, 19, 47, 'NH7vM6', 0, 0], [1251, 24, 3199, 22, 47, '47', 0, 0], [1802, 5, 3183, 20, 46, 'B', 0, 0], [2119, 5, 3183, 20, 46, 'b', 0, 0], [1714, 254, 3213, 23, 47, '2Za9eGyQyKp4S2rVYahzJNM', 0, 0], [1715, 55, 3261, 21, 48, 'djv', 0, 6], [1781, 41, 3267, 15, 48, '3WHD', 0, 6], [1864, 124, 3261, 21, 48, '8ucAV2oj', 0, 6], [2037, 139, 3261, 27, 48, 'baUoLawp6rY', 0, 6], [2196, 76, 3261, 21, 48, 'sRheu', 6, 6], [1715, 226, 3295, 21, 49, 'hAfhkKsI7Jx', 0, 6], [1959, 234, 3295, 21, 49, 'quecbSW4gEdjSGG', 0, 6], [1715, 176, 3329, 27, 50, 'ciaZR8NxiuEXr1', 0, 6], [1910, 140, 3329, 21, 50, 'vicUyHPNcN', 0, 6]]
data_pd = pd.DataFrame(data_array,columns=["left", "width", "top", "height", "lineNr", "randomWord", "keyWord", "keyWordGroup"])
### group words to keywords
from scipy.cluster.vq import vq
keywords = pd.concat([data_pd[data_pd.keyWord!=0].left + data_pd[data_pd.keyWord!=0].width/2, data_pd[data_pd.keyWord!=0].top + data_pd[data_pd.keyWord!=0].height/2], axis=1)
words = pd.concat([data_pd[data_pd.keyWord==0].left + data_pd[data_pd.keyWord==0].width/2, data_pd[data_pd.keyWord==0].top + data_pd[data_pd.keyWord==0].height/2], axis=1)
res = vq(words.to_numpy(),keywords.to_numpy())
### remove outliers
import numpy as np
factor = 1.5
limits = []
# calculate limit as limit = mean + factor * stddev for each keyWord
for i in range(len(keywords)):
limits.append(np.mean(res[1][res[0]==i]) + factor * np.std(res[1][res[0]==i]))
# mark words with distance > limit as outliers
for i in range(len(res[0])):
if res[1][i] > limits[res[0][i]]:
res[0][i] = -1
### assign results to dataframe
words['keyWordGroupNew'] = res[0] + 1
keywords['keyWordGroupNew'] = range(1, len(keywords) + 1)
data_pd = pd.concat([data_pd, pd.concat([words['keyWordGroupNew'],keywords['keyWordGroupNew']])], axis=1, join='outer')
# renumber keyWordGroup according to keyWord numbering
dic = dict(zip(range(1, len(keywords) + 1), data_pd[data_pd.keyWord!=0]['keyWord']))
dic[0] = 0
data_pd.keyWordGroupNew = data_pd.keyWordGroupNew.map(dic)
from PIL import Image, ImageFont, ImageDraw # pip install Pillow
import random
#Create a empty image object
new_im = Image.new('RGB', ((data_pd["left"]+data_pd["width"]).max() + data_pd["left"].min(), (data_pd["top"]+data_pd["height"]).max() + data_pd["top"].min() ), (255,255,255))
draw_new_im = ImageDraw.Draw(new_im)
#Create a dictioanry with random colors to assign for each uniq keyWordGroupNew
uniqGroups = data_pd["keyWordGroupNew"].unique()
colors = {}
i = 0
for g in uniqGroups:
if(g == 0):
colors[str(g)] = "black" # assign black color for non groups
else:
colors[str(g)] = "hsl(" + str(70 + i * 290 / (len(uniqGroups) - 2)) + ",100%,50%)"
i += 1
#Write text to the image
for i, row in data_pd.iterrows():
draw_new_im.text((int(row["left"]), int(row["top"])), str(row["randomWord"]) , fill=colors[str(row["keyWordGroupNew"])],font=ImageFont.truetype("arial.ttf", int(row["height"])))
if row["keyWord"] > 0:
draw_new_im.rectangle([row["left"] ,row["top"], row["left"]+row["width"], row["top"]+row["height"]], outline=colors[str(row["keyWordGroupNew"])])
#Save the image
new_im.save("out-std.jpg")
As you see in the code, I also made two small improvements to the image generation: uniformly distribute the colors over the hue range from yellow to red and draw frames around the keywords.
Another approach to outlier detection is called local outlier factor. This technique marks isolated words thar are not surrounded by other group members as outliers.
### group words to keywords
from scipy.cluster.vq import vq
keywords = pd.concat([data_pd[data_pd.keyWord!=0].left + data_pd[data_pd.keyWord!=0].width/2, data_pd[data_pd.keyWord!=0].top + data_pd[data_pd.keyWord!=0].height/2], axis=1)
words = pd.concat([data_pd[data_pd.keyWord==0].left + data_pd[data_pd.keyWord==0].width/2, data_pd[data_pd.keyWord==0].top + data_pd[data_pd.keyWord==0].height/2], axis=1)
res = vq(words.to_numpy(),keywords.to_numpy())
# assign results
words['keyWordGroupNew'] = res[0] + 1
keywords['keyWordGroupNew'] = range(1, len(keywords) + 1)
### remove outliers
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors=10, contamination=0.1)
for i in range(1, len(keywords)+1):
y_pred = clf.fit_predict(words[words.keyWordGroupNew==i].iloc[:,0:2].to_numpy())
words.loc[words[words.keyWordGroupNew==i][y_pred==-1].index,'keyWordGroupNew'] = 0 # mark as outlier
### save results to dataframe
data_pd = pd.concat([data_pd, pd.concat([words['keyWordGroupNew'],keywords['keyWordGroupNew']])], axis=1, join='outer')
# renumber keyWordGroup according to keyWord numbering
dic = dict(zip(range(1, len(keywords) + 1), data_pd[data_pd.keyWord!=0]['keyWord']))
dic[0] = 0
data_pd.keyWordGroupNew = data_pd.keyWordGroupNew.map(dic)
# image generation as in previous example ...
This doesn't work so well for relatively small groups and the results are visually not so good as with the other method when the keyword is located outside the center of the group.
The result for contamination = 0.1, which is a commonly used value, is here. For details see the original paper and the sklearn docs.
Conclusion: both approached give satisfactory results that can be tuned by adjusting the factor x or contamination respectively.

Encode numpy array using uncompressed RLE for COCO dataset

To create a COCO dataset of annotated images, you need to convert binary masks into either polygons or uncompressed run length encoding representations depending on the type of object.
The pycocotools library has functions to encode and decode into and from compressed RLE, but nothing for polygons and uncompressed RLE.
I can use skimage's measure library to generate polygons of masks, but I'm not sure how to create uncompressed RLEs.
I can use this RLE encoder to create a representation of RLE from an image, but I'm not sure what format COCO expects. COCO just mentions that they use a "custom Run Length Encoding (RLE) scheme"
for example,
ground_truth_binary_mask = np.array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=np.uint8)
fortran_ground_truth_binary_mask = np.asfortranarray(ground_truth_binary_mask)
rle(fortran_ground_truth_binary_mask)
outputs:
(array([26, 36, 46, 56, 61]), array([3, 3, 3, 3, 1]))
this is what a coco RLE looks like:
{
"segmentation": {
"counts": [
272,
2,
4,
4,
4,
4,
2,
9,
1,
2,
16,
43,
143,
24,
5,
8,
16,
44,
141,
25,
8,
5,
17,
44,
140,
26,
10,
2,
17,
45,
129,
4,
5,
27,
24,
5,
1,
45,
127,
38,
23,
52,
125,
40,
22,
53,
123,
43,
20,
54,
122,
46,
18,
54,
121,
54,
12,
53,
119,
57,
11,
53,
117,
59,
13,
51,
117,
59,
13,
51,
117,
60,
11,
52,
117,
60,
10,
52,
118,
60,
9,
53,
118,
61,
8,
52,
119,
62,
7,
52,
119,
64,
1,
2,
2,
51,
120,
120,
120,
101,
139,
98,
142,
96,
144,
93,
147,
90,
150,
87,
153,
85,
155,
82,
158,
76,
164,
66,
174,
61,
179,
57,
183,
54,
186,
52,
188,
49,
191,
47,
193,
21,
8,
16,
195,
20,
13,
8,
199,
18,
222,
17,
223,
16,
224,
16,
224,
15,
225,
15,
225,
15,
225,
15,
225,
15,
225,
15,
225,
15,
225,
15,
225,
15,
225,
14,
226,
14,
226,
14,
39,
1,
186,
14,
39,
3,
184,
14,
39,
4,
183,
13,
40,
6,
181,
14,
39,
7,
180,
14,
39,
9,
178,
14,
39,
10,
177,
14,
39,
11,
176,
14,
38,
14,
174,
14,
36,
19,
171,
15,
33,
32,
160,
16,
30,
35,
159,
18,
26,
38,
158,
19,
23,
41,
157,
20,
19,
45,
156,
21,
15,
48,
156,
22,
10,
53,
155,
23,
9,
54,
154,
23,
8,
55,
154,
24,
7,
56,
153,
24,
6,
57,
153,
25,
5,
57,
153,
25,
5,
58,
152,
25,
4,
59,
152,
26,
3,
59,
152,
26,
3,
59,
152,
27,
1,
60,
152,
27,
1,
60,
152,
86,
154,
80,
160,
79,
161,
42,
8,
29,
161,
41,
11,
22,
2,
3,
161,
40,
13,
18,
5,
3,
161,
40,
15,
2,
5,
8,
7,
2,
161,
40,
24,
6,
170,
35,
30,
4,
171,
34,
206,
34,
41,
1,
164,
34,
39,
3,
164,
34,
37,
5,
164,
34,
35,
10,
161,
36,
1,
3,
28,
17,
155,
41,
27,
16,
156,
41,
26,
17,
156,
41,
26,
16,
157,
27,
4,
10,
25,
16,
158,
27,
6,
8,
11,
2,
12,
6,
2,
7,
159,
27,
7,
14,
3,
4,
19,
6,
160,
26,
8,
22,
18,
5,
161,
26,
8,
22,
18,
4,
162,
26,
8,
23,
15,
4,
164,
23,
11,
23,
11,
7,
165,
19,
17,
22,
9,
6,
167,
19,
22,
18,
8,
3,
170,
18,
25,
16,
7,
1,
173,
17,
28,
15,
180,
17,
30,
12,
181,
16,
34,
6,
184,
15,
225,
14,
226,
13,
227,
12,
228,
11,
229,
10,
230,
9,
231,
9,
231,
9,
231,
9,
231,
8,
232,
8,
232,
8,
232,
8,
232,
8,
232,
8,
232,
7,
233,
7,
233,
7,
233,
7,
233,
8,
232,
8,
232,
8,
232,
9,
231,
9,
231,
9,
231,
10,
230,
10,
230,
11,
229,
13,
227,
14,
226,
16,
224,
17,
223,
19,
221,
23,
217,
31,
3,
5,
201,
39,
201,
39,
201,
39,
201,
39,
201,
39,
201,
40,
200,
40,
200,
41,
199,
41,
199,
41,
199,
22,
8,
12,
198,
22,
12,
8,
198,
22,
14,
6,
198,
22,
15,
6,
197,
22,
16,
5,
197,
22,
17,
5,
196,
22,
18,
4,
196,
22,
19,
4,
195,
22,
19,
5,
194,
22,
20,
4,
194,
25,
21,
1,
193,
27,
213,
29,
211,
30,
210,
35,
6,
6,
193,
49,
191,
50,
190,
50,
190,
51,
189,
51,
189,
52,
188,
53,
187,
53,
187,
54,
186,
54,
186,
54,
186,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
55,
185,
28,
1,
26,
185,
23,
11,
21,
185,
20,
17,
17,
186,
18,
21,
15,
186,
16,
23,
14,
187,
14,
25,
14,
187,
14,
26,
12,
188,
14,
28,
10,
188,
14,
226,
14,
226,
16,
224,
17,
223,
19,
221,
20,
220,
22,
218,
24,
18,
3,
12,
3,
180,
25,
10,
1,
4,
6,
10,
6,
178,
28,
7,
12,
8,
8,
177,
49,
3,
12,
176,
65,
175,
67,
173,
69,
171,
53,
3,
14,
170,
37,
20,
9,
4,
1,
169,
36,
21,
8,
175,
35,
22,
7,
176,
34,
23,
7,
176,
34,
23,
6,
177,
35,
22,
6,
177,
35,
22,
8,
175,
35,
23,
9,
173,
35,
205,
36,
204,
39,
201,
43,
197,
48,
36,
1,
155,
48,
35,
3,
154,
49,
33,
5,
154,
48,
32,
6,
155,
49,
27,
10,
155,
51,
24,
11,
154,
54,
21,
11,
155,
56,
19,
11,
155,
56,
18,
11,
156,
56,
17,
11,
157,
56,
16,
12,
157,
56,
14,
13,
159,
56,
12,
13,
160,
61,
5,
14,
162,
78,
165,
75,
167,
73,
168,
72,
170,
70,
171,
69,
173,
67,
176,
64,
179,
61,
182,
58,
183,
57,
185,
54,
187,
53,
188,
51,
191,
49,
192,
47,
195,
45,
196,
43,
198,
42,
199,
40,
201,
38,
203,
36,
205,
34,
207,
32,
210,
28,
213,
26,
216,
22,
221,
16,
228,
8,
10250
],
"size": [
240,
320
]
}
}
Information on the format is available here: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/mask.py
RLE is a simple yet efficient format for storing binary masks. RLE
first divides a vector (or vectorized image) into a series of
piecewise constant regions and then for each piece simply stores the
length of that piece. For example, given M=[0 0 1 1 1 0 1] the RLE
counts would be [2 3 1 1], or for M=[1 1 1 1 1 1 0] the counts would
be [0 6 1] (note that the odd counts are always the numbers of zeros).
import numpy as np
from itertools import groupby
def binary_mask_to_rle(binary_mask):
rle = {'counts': [], 'size': list(binary_mask.shape)}
counts = rle.get('counts')
for i, (value, elements) in enumerate(groupby(binary_mask.ravel(order='F'))):
if i == 0 and value == 1:
counts.append(0)
counts.append(len(list(elements)))
return rle
test_list_1 = np.array([0, 0, 1, 1, 1, 0, 1])
test_list_2 = np.array([1, 1, 1, 1, 1, 1, 0])
print(binary_mask_to_rle(test_list_1))
print(binary_mask_to_rle(test_list_2))
output:
{'counts': [2, 3, 1, 1], 'size': [7]}
{'counts': [0, 6, 1], 'size': [7]}
You can use mask.frPyObjects(rle, size_x, size_y) to encode the RLE, and then do all the usual mask operations.
import json
import numpy as np
from pycocotools import mask
from skimage import measure
ground_truth_binary_mask = np.array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=np.uint8)
fortran_ground_truth_binary_mask = np.asfortranarray(ground_truth_binary_mask)
encode the mask to RLE:
rle = binary_mask_to_rle(fortran_ground_truth_binary_mask)
print(rle)
output:
{'counts': [6, 1, 40, 4, 5, 4, 5, 4, 21], 'size': [9, 10]}
compress the RLE, and then decode:
compressed_rle = mask.frPyObjects(rle, rle.get('size')[0], rle.get('size')[1])
mask.decode(compressed_rle)
output:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
As an improvement of #waspinator 's answer. This is 35% faster.
def binary_mask_to_rle(binary_mask):
rle = {'counts': [], 'size': list(binary_mask.shape)}
counts = rle.get('counts')
last_elem = 0
running_length = 0
for i, elem in enumerate(binary_mask.ravel(order='F')):
if elem == last_elem:
pass
else:
counts.append(running_length)
running_length = 0
last_elem = elem
running_length += 1
counts.append(running_length)
return rle
In order to decode the binary masks encoded in the COCO annotations, you need to first use the COCO's API to get the RLE and then use opencv to get the contours like this:
# Import libraries
import numpy as np
import cv2
import json
import mask
# Read the annotations
file_path = "coco/annotations/stuff_annotations_trainval2017/stuff_train2017.json"
with open(file_path, 'r') as f:
data = json.load(f)
updated_data = []
# For each annotation
for annotation in data['annotations']:
# Initialize variables
obj = {}
segmentation = []
segmentation_polygons = []
# Decode the binary mask
mask_list = mask.decode(annotation['segmentation'])
mask_list = np.ascontiguousarray(mask_list, dtype=np.uint8)
mask_new, contours, hierarchy = cv2.findContours((mask_list).astype(np.uint8), cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# Get the contours
for contour in contours:
contour = contour.flatten().tolist()
segmentation.append(contour)
if len(contour) > 4:
segmentation.append(contour)
if len(segmentation) == 0:
continue
# Get the polygons as (x, y) coordinates
for i, segment in enumerate(segmentation):
poligon = []
poligons = []
for j in range(len(segment)):
poligon.append(segment[j])
if (j+1)%2 == 0:
poligons.append(poligon)
poligon = []
segmentation_polygons.append(poligons)
# Save the segmentation and polygons for the current annotation
obj['segmentation'] = segmentation
obj['segmentation_polygons'] = segmentation_polygons
updated_data.append(obj)
Note: Only the COCO stuff 2017 annotations are using binary masks, the COCO person 2017 annotations aren't, so you don't need to decode the latter and find their contours.
Inspired by this solution.
Do it like this:
import numpy as np
from pycocotools import mask as m
# Create bool array
n_a = np.random.normal(size=(10, 10))
b_a = np.array(n_a > 0.5, dtype=np.bool, order='F')
# Encode bool array
e_a = m.encode(b_a)
d_a = m.decode(e_a)
# Decode byte string rle encoded mask
sz = pred['mask']['size']
c = pred['mask']['counts'][2:-1]
es = str.encode(c)
t = {'size': [450, 800], 'counts': es}
dm = m.decode(t)

Categories