Grouping Similar Strings in a long string [Python]

Grouping Similar Strings in a long string [Python] - python

I have four or five strings that all have a sequence, i want to group them in a list.
For example:
cake/1/
cake/2/
big/1/
nice/1/
cake/3/
I need the cakes in a list, the big in a list and the nice in a list
Here's what i've tried.
res = [list(i) for j, i in groupby(y, lambda a: a.split('/')[0])]
This didn't work, I thought of using regex but i'm not sure if that's anything to move forward through.
Here's the expected output
[['cake/1/', 'cake/2/', 'cake/3/'], ['big/1/'], ['nice/1/']]

You was nearly right:
groupby is changing the group each time the key is changing:
1112111 will be grouped as: 111 - 2 - 111
So if you want to guarantee that all your groups will be joined in one, you should sort your list of strings first so all same first-words-strings will be nearby and will not be splitted by another first-words-strings:
y = [
'cake/1/',
'cake/2/',
'big/1/',
'nice/1/',
'cake/3/'
]
res = [list(i) for j, i in groupby(sorted(y), lambda a: a.split('/')[0])]
^
|
HERE --------------------------------+
[['big/1/'], ['cake/1/', 'cake/2/', 'cake/3/'], ['nice/1/']]

Related

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!

If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

How to filter elements of Cartesian product following specific ordering conditions

I have to generate multiple reactions with different variables. They have 3 elements. Let's call them B, S and H. And they all start with B1. S can be appended to the element if there is at least one B. So it can be B1S1 or B2S2 or B2S1 etc... but not B1S2. The same goes for H. B1S1H1 or B2S2H1 or B4S1H1 but never B2S2H3. The final variation would be B5S5H5. I tried with itertools.product. But I don't know how to get rid of the elements that don't match my condition and how to add the next element. Here is my code:
import itertools
a = list(itertools.product([1, 2, 3, 4], repeat=4))
#print (a)
met = open('random_dat.dat', 'w')
met.write('Reactions')
met.write('\n')
for i in range(1,256):
met.write('\n')
met.write('%s: B%sS%sH%s -> B%sS%sH%s' %(i, a[i][3], a[i][2], a[i][1], a[i][3], a[i][2], a[i][1]))
met.write('\n')
met.close()

Simple for loops will do what you want:
bsh = []
for b in range(1,6):
for s in range(1,b+1):
for h in range(1,b+1):
bsh.append( f"B{b}S{s}H{h}" )
print(bsh)
Output:
['B1S1H1', 'B2S1H1', 'B2S1H2', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S1H2', 'B3S1H3',
'B3S2H1', 'B3S2H2', 'B3S2H3', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S1H2',
'B4S1H3', 'B4S1H4', 'B4S2H1', 'B4S2H2', 'B4S2H3', 'B4S2H4', 'B4S3H1', 'B4S3H2',
'B4S3H3', 'B4S3H4', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S1H2',
'B5S1H3', 'B5S1H4', 'B5S1H5', 'B5S2H1', 'B5S2H2', 'B5S2H3', 'B5S2H4', 'B5S2H5',
'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S3H4', 'B5S3H5', 'B5S4H1', 'B5S4H2', 'B5S4H3',
'B5S4H4', 'B5S4H5', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
Thanks to #mikuszefski for pointing out improvements.

Patrick his answer in list comprehension style
bsh = [f"B{b}S{s}H{h}" for b in range(1,5) for s in range(1,b+1) for h in range(1,b+1)]
Gives
['B1S1H1',
'B2S1H1',
'B2S1H2',
'B2S2H1',
'B2S2H2',
'B3S1H1',
'B3S1H2',
'B3S1H3',
'B3S2H1',
'B3S2H2',
'B3S2H3',
'B3S3H1',
'B3S3H2',
'B3S3H3',
'B4S1H1',
'B4S1H2',
'B4S1H3',
'B4S1H4',
'B4S2H1',
'B4S2H2',
'B4S2H3',
'B4S2H4',
'B4S3H1',
'B4S3H2',
'B4S3H3',
'B4S3H4',
'B4S4H1',
'B4S4H2',
'B4S4H3',
'B4S4H4']

I would implement your "use itertools.product and get rid off unnecessary elements" solution following way:
import itertools
a = list(itertools.product([1,2,3,4,5],repeat=3))
a = [i for i in a if (i[1]<=i[0] and i[2]<=i[1] and i[2]<=i[0])]
Note that I assumed last elements needs to be smaller or equal than any other. Note that a is now list of 35 tuples each holding 3 ints. So you need to made strs of them for example using so-called f-string:
a = [f"B{i[0]}S{i[1]}H{i[2]}" for i in a]
print(a)
output:
['B1S1H1', 'B2S1H1', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S2H1', 'B3S2H2', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S2H1', 'B4S2H2', 'B4S3H1', 'B4S3H2', 'B4S3H3', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S2H1', 'B5S2H2', 'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S4H1', 'B5S4H2', 'B5S4H3', 'B5S4H4', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
However you might also use another methods of formatting instead of f-string if you wish.

modify sign of numbers in two different columns of an RDD in PySpark

I am working on PySpark and I have a RDD which when printed looks like this:
[(-10.1571, -2.361), (-19.2108, 6.99), (10.1571, 4.47695), (22.5611, 20.360), (13.1668, -2.88), ....]
As you can see each element in this RDD has two data. Now what I want to do is check if the signs of two data are different then reverse the sign of 2nd data to match the first data. For example - in (-19.2108, 6.99) the signs of two data are different so I want to reverse the sign of 6.99 to make it -6.99 so that it matches sign of 1st data. But sign of data in (-10.1571, -2.361) and in (22.5611, 20.360) are same so no sign reversal in them.
How can I do this?

If this is just essentially a python list of tuples, just check if the first element, you don't actually care what the second is just just need to match the first :
l = [(-10.1571, -2.361), (-19.2108, 6.99), (10.1571, 4.47695), (22.5611, 20.360), (13.1668, -2.88)]
l[:] = [(a, -abs(b)) if a < 0 else (a, abs(b))for a, b in l]
print(l)
Output:
[(-10.1571, -2.361), (-19.2108, -6.99), (10.1571, 4.47695), (22.5611, 20.36), (13.1668, 2.88)]
Looking at the docs map might do the trick:
rdd1.map(lambda tup: (tup[0], -abs(tup[1])) if tup[0] < 0 else (tup[0], abs(tup[1])))

Sorting with two digits in string - Python

I am new to Python and I have a hard time solving this.
I am trying to sort a list to be able to human sort it 1) by the first number and 2) the second number. I would like to have something like this:
'1-1bird'
'1-1mouse'
'1-1nmouses'
'1-2mouse'
'1-2nmouses'
'1-3bird'
'10-1birds'
(...)
Those numbers can be from 1 to 99 ex: 99-99bird is possible.
This is the code I have after a couple of headaches. Being able to then sort by the following first letter would be a bonus.
Here is what I've tried:
#!/usr/bin/python
myList = list()
myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
def mysort(x,y):
x1=""
y1=""
for myletter in x :
if myletter.isdigit() or "-" in myletter:
x1=x1+myletter
x1 = x1.split("-")
for myletter in y :
if myletter.isdigit() or "-" in myletter:
y1=y1+myletter
y1 = y1.split("-")
if x1[0]>y1[0]:
return 1
elif x1[0]==y1[0]:
if x1[1]>y1[1]:
return 1
elif x1==y1:
return 0
else :
return -1
else :
return -1
myList.sort(mysort)
print myList
Thanks !
Martin

You have some good ideas with splitting on '-' and using isalpha() and isdigit(), but then we'll use those to create a function that takes in an item and returns a "clean" version of the item, which can be easily sorted. It will create a three-digit, zero-padded representation of the first number, then a similar thing with the second number, then the "word" portion (instead of just the first character). The result looks something like "001001bird" (that won't display - it'll just be used internally). The built-in function sorted() will use this callback function as a key, taking each element, passing it to the callback, and basing the sort order on the returned value. In the test, I use the * operator and the sep argument to print it without needing to construct a loop, but looping is perfectly fine as well.
def callback(item):
phrase = item.split('-')
first = phrase[0].rjust(3, '0')
second = ''.join(filter(str.isdigit, phrase[1])).rjust(3, '0')
word = ''.join(filter(str.isalpha, phrase[1]))
return first + second + word
Test:
>>> myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
>>> print(*sorted(myList, key=callback), sep='\n')
1-1bird
1-1cat
1-1mouse
1-1nmouses
1-1person
1-2bird
1-2cat
1-2mouse
1-2nmouses
1-2person
1-3bird
1-3cat
1-3mouse
1-3nmouses
1-3person
1-10bird
1-10cat
1-10mouse
1-10nmouses
1-10person
1-11bird
1-11cat
1-11mouse
1-11nmouses
1-11person
1-12bird
1-12mouse
1-12nmouses
1-12person
1-13mouse
1-13nmouses
1-13person
1-14bird
1-14cat
1-14mouse
1-14nmouses
1-14person
1-15cat
2-1bird
2-1cat
2-1mouse
2-1nmouses
2-1person
2-2bird
2-2mouse
2-2nmouses
2-2person
2-14cat
2-15cat
2-16cat

You need leading zeros. Strings are sorted alphabetically with the order different from the one for digits. It should be
'01-1bird'
'01-1mouse'
'01-1nmouses'
'01-2mouse'
'01-2nmouses'
'01-3bird'
'10-1birds'
As you you see 1 goes after 0.

The other answers here are very respectable, I'm sure, but for full credit you should ensure that your answer fits on a single line and uses as many list comprehensions as possible:
import itertools
[''.join(r) for r in sorted([[''.join(x) for _, x in
itertools.groupby(v, key=str.isdigit)]
for v in myList], key=lambda v: (int(v[0]), int(v[2]), v[3]))]
That should do nicely:
['1-1bird',
'1-1cat',
'1-1mouse',
'1-1nmouses',
'1-1person',
'1-2bird',
'1-2cat',
'1-2mouse',
...
'2-2person',
'2-14cat',
'2-15cat',
'2-16cat']

Confused by output from simple Python loop

I have a list of items and want to remove all items containing the number 16 (here a string, ind='16'). The list has six items with '16' in and yet my loop consistently only removes five of them. Puzzled (I even ran it on two separate machines)!
lines=['18\t4', '8\t5', '16\t5', '19\t6', '15\t7', '5\t8', '16\t8', '21\t8', '20\t12', '22\t13', '7\t15', '5\t16', '8\t16', '21\t16', '4\t18', '6\t19', '12\t20', '8\t21', '16\t21', '13\t22']
ind='16'
for query in lines:
if ind in query:
lines.remove(query)
Subsequently, typing 'lines' gives me: ['18\t4', '8\t5', '19\t6', '15\t7', '5\t8', '21\t8', '20\t12', '22\t13', '7\t15', '8\t16', '4\t18', '6\t19', '12\t20', '8\t21', '13\t22']
i.e. the item '8\t16' is still in the list???
Thank you
Clive

It's a bad idea to modify the list you are iterating over, as removing an item can confuse the iterator. Your example can be handled with a simple list comprehension, which just creates a new list to assign to the original name.
lines = [query for query in lines if ind not in query]
Or, use the filter function:
# In Python 2, you can omit the list() wrapper
lines = list(filter(lambda x: ind not in x, lines))

Note: Never modify list while looping
lines=['18\t4', '8\t5', '16\t5', '19\t6', '15\t7', '5\t8', '16\t8', '21\t8', '20\t12', '22\t13', '7\t15', '5\t16', '8\t16', '21\t16', '4\t18', '6\t19', '12\t20', '8\t21', '16\t21', '13\t22']
ind = '16'
new_lines = [ x for x in lines if ind not in x ]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping Similar Strings in a long string [Python] - python

Related

Finding all possible permutations of a hash when given list of grouped elements

How to filter elements of Cartesian product following specific ordering conditions

modify sign of numbers in two different columns of an RDD in PySpark

Sorting with two digits in string - Python

Confused by output from simple Python loop

Categories

Resources