Data transformation Python - python

the input to my function is a list of lists like the one below, and I want to divide each item in the list after the 3rd one by the third one for each list.. How do I do that?
So far I tried this:
def convert_data_percentages(data):
lst = []
for x in data:
lst.append(int(x[2:]) // int(x[2]))
return lst
convert_data_percentages(data, col_id=2)
[["'s-Gravenhage",
'GM0518',
'537833',
'266778',
'271055',
'92532',
'66099',
'162025',
'139157',
'78020',
'304766',
'162020',
'51430',
'19617'],
['Rotterdam',
'GM0599',
'644618',
'317935',
'326683',
'103680',
'86037',
'197424',
'159008',
'98469',
'367279',
'187835',
'62703',
'26801']]

Do you mean something like this?
data = [["'s-Gravenhage",
'GM0518',
'537833',
'266778',
'271055',
'92532',
'66099',
'162025',
'139157',
'78020',
'304766',
'162020',
'51430',
'19617'],
['Rotterdam',
'GM0599',
'644618',
'317935',
'326683',
'103680',
'86037',
'197424',
'159008',
'98469',
'367279',
'187835',
'62703',
'26801']]
# reformat original data, dropping first two items and converting to ints
# then extract third item and remaining subset into tuple
subsets = [(lst[:2] + to_int[:1], to_int[0], to_int[1:]) for lst in data if (to_int := [int(i) for i in lst[2:]])]
# walk through tuples generating a new list by dividing all items in the
# remaining subset by the third item
print([first_three + [i / third for i in subset] for first_three, third, subset in subsets])
returns:
[["'s-Gravenhage", 'GM0518', 537833, 0.4960238587070708, 0.5039761412929292, 0.1720459696597271, 0.1228987436620661, 0.30125522234596985, 0.2587364479308633, 0.14506361640137366, 0.5666554488103185, 0.3012459257799354, 0.09562447823023132, 0.03647414717951483],
['Rotterdam', 'GM0599', 644618, 0.49321458600287305, 0.506785413997127, 0.16083944289486177, 0.13346974487215685, 0.3062651058456326, 0.24667012090881732, 0.15275558547853146, 0.5697622467880202, 0.291389629206755, 0.09727156238268246, 0.04157656162254234]]

Related

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!
If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

Find unique elements within a certain range of coordinates

I'm trying to determine all unique elements within a list based on the x-y-coordinates. The list looks like the following structure:
List =[[[Picture1, [X-Coordinate, Y-Coordinate]], [Picture1, [X-Coordinate, Y-Coordinate]]],
[[Picture2, [X-Coordinate, Y-Coordinate]], [Picture2, [X-Coordinate, Y-Coordinate]]],
....]
This is the actuall list:
MyList = [[['IMG_6094.jpg', [2773.0, 240.0]], ['IMG_6094.jpg', [2773.0, 240.0]]],
[['IMG_6096.jpg', [1464.0, 444.0]], ['IMG_6096.jpg', [3043.0, 2358.0]]],
[['IMG_6088.jpg', [1115.5, 371.5]]],
[['IMG_6090.jpg', [3083.0, 1982.5]], ['IMG_6090.jpg', [3083.0, 1982.5]]],
[['IMG_6093.jpg', [477.0, 481.0]], ['IMG_6093.jpg', [450.0, 487.5]]]]
As you can see, there are sometimes elements that have the same coordinates within a picture or are at least very close to each other. What I need to do is to throw out all non-unique or very close elements based on one of the coordinates (doesn't matter if its x or y).
The list should look like this:
MyList = [[['IMG_6094.jpg', [2773.0, 240.0]], --- thrown out because copy of first element ---],
[['IMG_6096.jpg', [1464.0, 444.0]], ['IMG_6096.jpg', [3043.0, 2358.0]]],
[['IMG_6088.jpg', [1115.5, 371.5]]],
[['IMG_6090.jpg', [3083.0, 1982.5]], --- thrown out because copy of first element---],
[['IMG_6093.jpg', [477.0, 481.0]], --- thrown out because e.g. abs(x-coordinates) < 30]
Could someone provide a elegant solution?
Thanks in advance!
you create for img a list of innteger that are close to it, than for the other points on that img, you check if their x or y or in the list, and if so add them to the remove list
btw when removing using the index make sure to start with largest x,y so you don't change the lower point index before removing them
remove = []
close = 30
for i in range(len(MyList)):
xList = []
yList = []
for j in range(len(MyList[i])):
if int(MyList[i][j][1][0]) in xList or int(MyList[i][j][1][1]) in yList:
remove.append([i,j])
continue
x = int(MyList[i][j][1][0])
y = int(MyList[i][j][1][1])
xList+= list(range(x-close,x+close))
yList+= list(range(y-close,y+close))

Is it possible to store some lists with different length in one dataframe?

I have some lists with different length and I am going to store them in one dataframe.
list1=[('G06F', 'H04L'),('H04N','G06F')]
list2=[('E06F', 'T08L'),('H05M', 'H03D'),('A05V', 'N03D')]
list3=[('M04F', 'A01B')]
I have been trying to have these lists in a dataframe coming with one row for each list.
I used mylist.append(), but it put the new list in one element after the former one
list2.append(simple_list1)
>>out:
[('E06F', 'T08L'), ('H05M', 'H03D'), ('A05V', 'N03D'), [('G06F', 'H04L'), ('H04N', 'G06F')]]
You can do it this way.
list1=[('G06F', 'H04L'),('H04N','G06F')]
list2=[('E06F', 'T08L'),('H05M', 'H03D'),('A05V', 'N03D')]
list3=[('M04F', 'A01B')]
list_combied = pd.DataFrame([[list1, list2, list3]]).T
list_combied
0 [(G06F, H04L), (H04N, G06F)]
1 [(E06F, T08L), (H05M, H03D), (A05V, N03D)]
2 [(M04F, A01B)]
use +
list1=[('G06F', 'H04L'),('H04N','G06F')]
list2=[('E06F', 'T08L'),('H05M', 'H03D'),('A05V', 'N03D')]
list3=[('M04F', 'A01B')]
lists = list1 + list2 + list3
print (lists)
The result is
[('G06F', 'H04L'), ('H04N', 'G06F'), ('E06F', 'T08L'), ('H05M', 'H03D'), ('A05V', 'N03D'), ('M04F', 'A01B')]

Group By Lists python

tests= ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-06',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
So I have the above list, how could I group by the list such that I can get a list which looks like the one below:
[['test-2017-09-19-12-06','test-2017-09-19-12-05'],
['test-2017-09-12-12-06','test-2017-09-12-12-05'],
['test-2017-09-07-12-05'],
['test-2017-09-06-12-07']]
I did try the following code but I get different results, where each string value becomes its own list rather than group by.
from itertools import groupby
print([list(j) for i, j in groupby(tests)])
See that range 0:15, you can use that to determine which segment is used for the grouping.
tests= ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-06',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
pprint.pprint([list(j[1]) for j in groupby(tests,lambda i:i[0:15])])
[['test-2017-09-19-12-06', 'test-2017-09-19-12-05'],
['test-2017-09-12-12-05', 'test-2017-09-12-12-05'],
['test-2017-09-07-12-05'],
['test-2017-09-06-12-07']]
You can split it with by size
tests = ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-05',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
size = 2
[tests[i:i+size] for i in range(0, len(tests), size)]

modify sign of numbers in two different columns of an RDD in PySpark

I am working on PySpark and I have a RDD which when printed looks like this:
[(-10.1571, -2.361), (-19.2108, 6.99), (10.1571, 4.47695), (22.5611, 20.360), (13.1668, -2.88), ....]
As you can see each element in this RDD has two data. Now what I want to do is check if the signs of two data are different then reverse the sign of 2nd data to match the first data. For example - in (-19.2108, 6.99) the signs of two data are different so I want to reverse the sign of 6.99 to make it -6.99 so that it matches sign of 1st data. But sign of data in (-10.1571, -2.361) and in (22.5611, 20.360) are same so no sign reversal in them.
How can I do this?
If this is just essentially a python list of tuples, just check if the first element, you don't actually care what the second is just just need to match the first :
l = [(-10.1571, -2.361), (-19.2108, 6.99), (10.1571, 4.47695), (22.5611, 20.360), (13.1668, -2.88)]
l[:] = [(a, -abs(b)) if a < 0 else (a, abs(b))for a, b in l]
print(l)
Output:
[(-10.1571, -2.361), (-19.2108, -6.99), (10.1571, 4.47695), (22.5611, 20.36), (13.1668, 2.88)]
Looking at the docs map might do the trick:
rdd1.map(lambda tup: (tup[0], -abs(tup[1])) if tup[0] < 0 else (tup[0], abs(tup[1])))

Categories