Group By Lists python - python

tests= ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-06',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
So I have the above list, how could I group by the list such that I can get a list which looks like the one below:
[['test-2017-09-19-12-06','test-2017-09-19-12-05'],
['test-2017-09-12-12-06','test-2017-09-12-12-05'],
['test-2017-09-07-12-05'],
['test-2017-09-06-12-07']]
I did try the following code but I get different results, where each string value becomes its own list rather than group by.
from itertools import groupby
print([list(j) for i, j in groupby(tests)])

See that range 0:15, you can use that to determine which segment is used for the grouping.
tests= ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-06',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
pprint.pprint([list(j[1]) for j in groupby(tests,lambda i:i[0:15])])
[['test-2017-09-19-12-06', 'test-2017-09-19-12-05'],
['test-2017-09-12-12-05', 'test-2017-09-12-12-05'],
['test-2017-09-07-12-05'],
['test-2017-09-06-12-07']]

You can split it with by size
tests = ['test-2017-09-19-12-06',
'test-2017-09-19-12-05',
'test-2017-09-12-12-05',
'test-2017-09-12-12-05',
'test-2017-09-07-12-05',
'test-2017-09-06-12-07']
size = 2
[tests[i:i+size] for i in range(0, len(tests), size)]

Related

Data transformation Python

the input to my function is a list of lists like the one below, and I want to divide each item in the list after the 3rd one by the third one for each list.. How do I do that?
So far I tried this:
def convert_data_percentages(data):
lst = []
for x in data:
lst.append(int(x[2:]) // int(x[2]))
return lst
convert_data_percentages(data, col_id=2)
[["'s-Gravenhage",
'GM0518',
'537833',
'266778',
'271055',
'92532',
'66099',
'162025',
'139157',
'78020',
'304766',
'162020',
'51430',
'19617'],
['Rotterdam',
'GM0599',
'644618',
'317935',
'326683',
'103680',
'86037',
'197424',
'159008',
'98469',
'367279',
'187835',
'62703',
'26801']]
Do you mean something like this?
data = [["'s-Gravenhage",
'GM0518',
'537833',
'266778',
'271055',
'92532',
'66099',
'162025',
'139157',
'78020',
'304766',
'162020',
'51430',
'19617'],
['Rotterdam',
'GM0599',
'644618',
'317935',
'326683',
'103680',
'86037',
'197424',
'159008',
'98469',
'367279',
'187835',
'62703',
'26801']]
# reformat original data, dropping first two items and converting to ints
# then extract third item and remaining subset into tuple
subsets = [(lst[:2] + to_int[:1], to_int[0], to_int[1:]) for lst in data if (to_int := [int(i) for i in lst[2:]])]
# walk through tuples generating a new list by dividing all items in the
# remaining subset by the third item
print([first_three + [i / third for i in subset] for first_three, third, subset in subsets])
returns:
[["'s-Gravenhage", 'GM0518', 537833, 0.4960238587070708, 0.5039761412929292, 0.1720459696597271, 0.1228987436620661, 0.30125522234596985, 0.2587364479308633, 0.14506361640137366, 0.5666554488103185, 0.3012459257799354, 0.09562447823023132, 0.03647414717951483],
['Rotterdam', 'GM0599', 644618, 0.49321458600287305, 0.506785413997127, 0.16083944289486177, 0.13346974487215685, 0.3062651058456326, 0.24667012090881732, 0.15275558547853146, 0.5697622467880202, 0.291389629206755, 0.09727156238268246, 0.04157656162254234]]

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!
If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

How to break protein sequence into equal size of 20 in python

I want to break the structure of protein into the chunks of 20 equal size
the structure of the protein is something like this
MASTEGANNMPKQVEVRMHDSHLGSEEPKHRHLGLRLCDKLGKNLLLTLTVFGVILGAVCGGLLRLASPI
HPDVVMLIAFPGDILMRMLKMLILPLIISSLITGLSGLDAKASGRLGTRAMVYYMSTTIIAAVLGVILVL
AIHPGNPKLKKQLGPGKKNDEVSSLDAFLDLIRNLFPENLVQACFQQIQTVTKKVLVAPPPDEEANATSA
VVSLLNETVTEVPEETKMVIKKGLEFKDGMNVLGLIGFFIAFGIAMGKMGDQAKLMVDFFNILNEIVMKL
VIMIMWYSPLGIACLICGKIIAIKDLEVVARQLGMYMVTVIIGLIIHGGIFLPLIYFVVTRKNPFSFFAG
IFQAWITALGTASSAGTLPVTFRCLEENLGIDKRVTRFVLPVGATINMDGTALYEAVAAIFIAQMNGVVL
DGGQIVTVSLTATLASVGAASIPSAGLVTMLLILTAVGLPTEDISLLVAVDWLLDRMRTSVNVVGDSFGA
GIVYHLSKSELDTIDSQHRVHEDIEMTKTQSIYDDMKNHRESNSNQCVYAAHNSVIVDECKVTLAANGKS
ADCSVEEEPWKREK
I have tried the by iterating loop
x="abfgjjhuyuryitfvbkjuhhgyuumnabcdfrfhghhoiutgfctrdgfvijnk"
length=len(x)
values= [length/20+1]
a=1
for i in length(a,x)
print(i)
but this is not working
Try this by importing the textwrap
import textwrap
myArray="MASTEGANNMPKQVEVRMHDSHLGSEEPKHRHLGLRLCDKLGKNLLLTLTVFGVILGAVCGGLLRLASPIHPDVVMLIAFPGDILMRMLKMLILPLIISSLITGLSGLDAKASGRLGTRAMVYYMSTTIIAAVLGVILVLAIHPGNPKLKKQLGPGKKNDEVSSLDAFLDLIRNLFPENLVQACFQQIQTVTKKVLVAPPPDEEANATSAVVSLLNETVTEVPEETKMVIKKGLEFKDGMNVLGLIGFFIAFGIAMGKMGDQAKLMVDFFNILNEIVMKLVIMIMWYSPLGIACLICGKIIAIKDLEVVARQLGMYMVTVIIGLIIHGGIFLPLIYFVVTRKNPFSFFAGIFQAWITALGTASSAGTLPVTFRCLEENLGIDKRVTRFVLPVGATINMDGTALYEAVAAIFIAQMNGVVLDGGQIVTVSLTATLASVGAASIPSAGLVTMLLILTAVGLPTEDISLLVAVDWLLDRMRTSVNVVGDSFGAGIVYHLSKSELDTIDSQHRVHEDIEMTKTQSIYDDMKNHRESNSNQCVYAAHNSVIVDECKVTLAANGKSADCSVEEEPWKREK"
list_string = str(myArray)
textwrap.wrap(list_string, 20)
the output is something like this!
['MASTEGANNMPKQVEVRMHD',
'SHLGSEEPKHRHLGLRLCDK',
'LGKNLLLTLTVFGVILGAVC',
'GGLLRLASPIHPDVVMLIAF',
'PGDILMRMLKMLILPLIISS',
'LITGLSGLDAKASGRLGTRA',
'MVYYMSTTIIAAVLGVILVL',
'AIHPGNPKLKKQLGPGKKND',
'EVSSLDAFLDLIRNLFPENL',
'VQACFQQIQTVTKKVLVAPP',
'PDEEANATSAVVSLLNETVT',
'EVPEETKMVIKKGLEFKDGM',
'NVLGLIGFFIAFGIAMGKMG',
'DQAKLMVDFFNILNEIVMKL',
'VIMIMWYSPLGIACLICGKI',
'IAIKDLEVVARQLGMYMVTV',
'IIGLIIHGGIFLPLIYFVVT',
'RKNPFSFFAGIFQAWITALG',
'TASSAGTLPVTFRCLEENLG',
'IDKRVTRFVLPVGATINMDG',
'TALYEAVAAIFIAQMNGVVL',
'DGGQIVTVSLTATLASVGAA',
'SIPSAGLVTMLLILTAVGLP',
'TEDISLLVAVDWLLDRMRTS',
'VNVVGDSFGAGIVYHLSKSE',
'LDTIDSQHRVHEDIEMTKTQ',
'SIYDDMKNHRESNSNQCVYA',
'AHNSVIVDECKVTLAANGKS',
'ADCSVEEEPWKREK']
Something like this would do the trick:
values = [x[i:i+20] for i in range(0, len(x), 20)]
Just as a reference:
x[a:b] takes a slice of the string x from the index a up to (but not including) index b, therefore x[i:i+20] takes a slice of size twenty starting from index i.
range(a, b, step) would generate a sequence of numbers from a up to (but not including) b with increments of step.
You could use re.findall, after first removing all whitespace, e.g.:
inp = """MASTEGANNMPKQVEVRMHDSHLGSEEPKHRHLGLRLCDKLGKNLLLTLTVFGVILGAVCGGLLRLASPI
HPDVVMLIAFPGDILMRMLKMLILPLIISSLITGLSGLDAKASGRLGTRAMVYYMSTTIIAAVLGVILVL
AIHPGNPKLKKQLGPGKKNDEVSSLDAFLDLIRNLFPENLVQACFQQIQTVTKKVLVAPPPDEEANATSA"""
inp = re.sub(r'\s+', '', inp)
chunks = re.findall(r'.{1,20}', inp)
This prints:
['MASTEGANNMPKQVEVRMHD',
'SHLGSEEPKHRHLGLRLCDK',
'LGKNLLLTLTVFGVILGAVC',
'GGLLRLASPIHPDVVMLIAF',
'PGDILMRMLKMLILPLIISS',
'LITGLSGLDAKASGRLGTRA',
'MVYYMSTTIIAAVLGVILVL',
'AIHPGNPKLKKQLGPGKKND',
'EVSSLDAFLDLIRNLFPENL',
'VQACFQQIQTVTKKVLVAPP',
'PDEEANATSA']
You could use something like this :
## With protein your string containing the data
size_of_split = 20
splited_protein = list(map(''.join, zip(*[iter(protein)]*size_of_split)))
It uses zip() in a way that is described in its documentation.
To quote it :
[...] clustering a data series into n-length groups using zip(*[iter(s)]*n).
This repeats the same iterator n times so that each output tuple has
the result of n calls to the iterator. This has the effect of dividing
the input into n-length chunks.

How to sum specific elements in an array

I want to sum elements in an array. For example I have an array
[183948, 218520, 243141, 224539, 205322, 203855, 233281, 244830, 281245,
280579, 235384, 183596, 106072, 88773, 63297, 38769, 28343]
I want to sum it in three different parts which are the first three elements, the next 10 elements and the rest.
My only idea is to separate the array and use sum method. Is there better way to do it?
Thanks in advance!
try this:
arr=[183948, 218520, 243141, 224539, 205322, 203855, 233281, 244830, 281245,
280579, 235384, 183596, 106072, 88773, 63297, 38769, 28343]
first=arr[0:3]
second=arr[3:13]
last=arr[13:]
print(sum(first))
print(sum(second))
print(sum(last))
the alternative more extensible version is as follows
arr=[183948, 218520, 243141, 224539, 205322, 203855, 233281, 244830, 281245,
280579, 235384, 183596, 106072, 88773, 63297, 38769, 28343]
indices=[3,13]
results=[]
prev=0
for i in indices:
results.append(sum(arr[prev:i]))
prev=i
results.append(sum(arr[prev:]))
for res in results:
print(res)
note: set prev = to the index you want to start from, in this case 0
You can use the reduceat method of np.add:
data = [183948, 218520, 243141, 224539, 205322, 203855, 233281, 244830, 281245,
280579, 235384, 183596, 106072, 88773, 63297, 38769, 28343]
sizes = 3, 10
np.add.reduceat(data, np.cumsum([0, *sizes]))
# array([ 645609, 2198703, 219182])

How to filter elements of Cartesian product following specific ordering conditions

I have to generate multiple reactions with different variables. They have 3 elements. Let's call them B, S and H. And they all start with B1. S can be appended to the element if there is at least one B. So it can be B1S1 or B2S2 or B2S1 etc... but not B1S2. The same goes for H. B1S1H1 or B2S2H1 or B4S1H1 but never B2S2H3. The final variation would be B5S5H5. I tried with itertools.product. But I don't know how to get rid of the elements that don't match my condition and how to add the next element. Here is my code:
import itertools
a = list(itertools.product([1, 2, 3, 4], repeat=4))
#print (a)
met = open('random_dat.dat', 'w')
met.write('Reactions')
met.write('\n')
for i in range(1,256):
met.write('\n')
met.write('%s: B%sS%sH%s -> B%sS%sH%s' %(i, a[i][3], a[i][2], a[i][1], a[i][3], a[i][2], a[i][1]))
met.write('\n')
met.close()
Simple for loops will do what you want:
bsh = []
for b in range(1,6):
for s in range(1,b+1):
for h in range(1,b+1):
bsh.append( f"B{b}S{s}H{h}" )
print(bsh)
Output:
['B1S1H1', 'B2S1H1', 'B2S1H2', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S1H2', 'B3S1H3',
'B3S2H1', 'B3S2H2', 'B3S2H3', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S1H2',
'B4S1H3', 'B4S1H4', 'B4S2H1', 'B4S2H2', 'B4S2H3', 'B4S2H4', 'B4S3H1', 'B4S3H2',
'B4S3H3', 'B4S3H4', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S1H2',
'B5S1H3', 'B5S1H4', 'B5S1H5', 'B5S2H1', 'B5S2H2', 'B5S2H3', 'B5S2H4', 'B5S2H5',
'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S3H4', 'B5S3H5', 'B5S4H1', 'B5S4H2', 'B5S4H3',
'B5S4H4', 'B5S4H5', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
Thanks to #mikuszefski for pointing out improvements.
Patrick his answer in list comprehension style
bsh = [f"B{b}S{s}H{h}" for b in range(1,5) for s in range(1,b+1) for h in range(1,b+1)]
Gives
['B1S1H1',
'B2S1H1',
'B2S1H2',
'B2S2H1',
'B2S2H2',
'B3S1H1',
'B3S1H2',
'B3S1H3',
'B3S2H1',
'B3S2H2',
'B3S2H3',
'B3S3H1',
'B3S3H2',
'B3S3H3',
'B4S1H1',
'B4S1H2',
'B4S1H3',
'B4S1H4',
'B4S2H1',
'B4S2H2',
'B4S2H3',
'B4S2H4',
'B4S3H1',
'B4S3H2',
'B4S3H3',
'B4S3H4',
'B4S4H1',
'B4S4H2',
'B4S4H3',
'B4S4H4']
I would implement your "use itertools.product and get rid off unnecessary elements" solution following way:
import itertools
a = list(itertools.product([1,2,3,4,5],repeat=3))
a = [i for i in a if (i[1]<=i[0] and i[2]<=i[1] and i[2]<=i[0])]
Note that I assumed last elements needs to be smaller or equal than any other. Note that a is now list of 35 tuples each holding 3 ints. So you need to made strs of them for example using so-called f-string:
a = [f"B{i[0]}S{i[1]}H{i[2]}" for i in a]
print(a)
output:
['B1S1H1', 'B2S1H1', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S2H1', 'B3S2H2', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S2H1', 'B4S2H2', 'B4S3H1', 'B4S3H2', 'B4S3H3', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S2H1', 'B5S2H2', 'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S4H1', 'B5S4H2', 'B5S4H3', 'B5S4H4', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
However you might also use another methods of formatting instead of f-string if you wish.

Categories