Ordering a string by its substring numerical value in python - python

I have a list of strings that need to be sorted in numerical order using as a int key two substrings.
Obviously using the sort() function orders my strings alphabetically so I get 1,10,2... that is obviously not what I'm looking for.
Searching around I found a key parameter can be passed to the sort() function, and using sort(key=int) should do the trick, but being my key a substring and not the whole string should lead to a cast error.
Supposing my strings are something like:
test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1
I want my list to be ordered in numeric order on the basis of the first integer and then on the second, so I would have:
test1txtfgg2
test1txtfgf10
test2txtsdsd1
test2txffdt3
I think I could extract the integer values, sort only them keeping track of what string they belong to and then ordering the strings, but I was wondering if there's a way to do this thing in a more efficient and elegant way.
Thanks in advance

Try the following
In [26]: import re
In [27]: f = lambda x: [int(x) for x in re.findall(r'\d+', x)]
In [28]: sorted(strings, key=f)
Out[28]: ['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']
This uses regex (the re module) to find all integers in each string, then compares the resulting lists. For example, f('test1txtfgg2') returns [1, 2], which is then compared against other lists.

Extract the numeric parts and sort using them
import re
d = """test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1"""
lines = d.split("\n")
re_numeric = re.compile("^[^\d]+(\d+)[^\d]+(\d+)$")
def key(line):
"""Returns a tuple (n1, n2) of the numeric parts of line."""
m = re_numeric.match(line)
if m:
return (int(m.groups(1)), int(m.groups(2)))
else:
return None
lines.sort(key=key)
Now lines are
['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']

import re
k = [
"test1txtfgf10",
"test1txtfgg2",
"test2txffdt3",
"test2txtsdsd1"
]
tmp = [([e for e in re.split("[a-z]",el) if e], el) for el in k ]
sorted(tmp, key=lambda k: tmp[0])
tmp = [res for cm, res in tmp]

Related

Sorting a list of strings based on numeric order of numeric part

I have a list of strings that may contain digits. I would like to sort this list alphabetically, but every time the String contains a number, I want it to be sorted by value.
For example, if the list is
['a1a','b1a','a10a','a5b','a2a'],
the sorted list should be
['a1a','a2a','a5b','a10a','b1a']
In general I want to treat each number (a sequence of digits) in the string as a special character, which is smaller than any letter and can be compared numerically to other numbers.
Is there any python function which does this compactly?
You could use the re module to split each string into a tuple of characters and grouping the digits into one single element. Something like r'(\d+)|(.)'. The good news with this regex is that it will return separately the numeric and non numeric groups.
As a simple key, we could use:
def key(x):
# the tuple comparison will ensure that numbers come before letters
return [(j, int(i)) if i != '' else (j, i)
for i, j in re.findall(r'(\d+)|(.)', x)]
Demo:
lst = ['a1a', 'a2a', 'a5b', 'a10a', 'b1a', 'abc']
print(sorted(lst, key=key)
gives:
['a1a', 'a2a', 'a5b', 'a10a', 'abc', 'b1a']
If you want a more efficient processing, we could compile the regex only once in a closure
def build_key():
rx = re.compile(r'(\d+)|(.)')
def key(x):
return [(j, int(i)) if i != '' else (j, i)
for i, j in rx.findall(x)]
return key
and use it that way:
sorted(lst, key=build_key())
giving of course the same output.

python efficient way to compare nested lists and append matches to new list

I wish to compare two nested lists. If there is a match between the first element of each sublist, I wish to add the matched element to a new list for further operations. Below is an example and what I've tried so far:
Example:
x = [['item1','somethingelse1'], ['item2', 'somethingelse2']...]
y = [['item1','somethingelse3'], ['item3','somethingelse4']...]
What I've I tried so far:
match = []
for itemx in x:
for itemy in y:
if itemx[0] == itemy[0]:
match.append(itemx)
The above of what I tried did the job of appending the matched item into the new list, but I have two very long nested lists, and what I did above is very slow for operating on very long lists. Are there any more efficient ways to get out the matched item between two nested lists?
Yes, use a data structure with constant-time membership testing. So, using a set, for example:
seen = set()
for first,_ in x:
seen.add(first)
matched = []
for first,_ in y:
if first in seen:
matched.append(first)
Or, more succinctly using set/list comprehensions:
seen = {first for first,_ in x}
matched = [first for first,_ in y if first in seen]
(This was before the OP changed the question from append(itemx[0]) to append(itemx)...)
>>> {a[0] for a in x} & {b[0] for b in y}
{'item1'}
Or if the inner lists are always pairs:
>>> dict(x).keys() & dict(y)
{'item1'}
IIUC using numpy:
import numpy as np
y=[l[0] for l in y]
x=np.array(x)
x[np.isin(x[:, 0], y)]

Can i include multiple statements when creating a one-line for loop?

I have an array I want to iterate through. The array consists of strings consisting of numbers and signs.
like this: €110.5M
I want to loop over it and remove all Euro sign and also the M and return that array with the strings as ints.
How would I do this knowing that the array is a column in a table?
You could just strip the characters,
>>> x = '€110.5M'
>>> x.strip('€M')
'110.5'
def sanitize_string(ss):
ss = ss.replace('$', '').replace('€', '').lower()
if 'm' in ss:
res = float(ss.replace('m', '')) * 1000000
elif 'k' in ss:
res = float(ss.replace('k', '')) * 1000
return int(res)
This can be applied to a list as follows:
>>> ls = [sanitize_string(x) for x in ["€3.5M", "€15.7M" , "€167M"]]
>>> ls
[3500000, 15700000, 167000000]
If you want to apply it to the column of a table instead:
dataFrame = dataFrame.price.apply(sanitize_string) # Assuming you're using DataFrames and the column is called 'price'
You can use a string comprehension:
numbers = [float(p.replace('€','').replace('M','')) for p in a]
which gives:
[110.5, 210.5, 310.5]
You can use a list comprehension to construct one list from another:
foo = ["€13.5M", "€15M" , "€167M"]
foo_cleaned = [value.translate(None, "€M")]
str.translate replaces all occurrences of characters in the latter string with the first argument None.
Try this
arr = ["€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M"]
f = [x.replace("€","").replace("M","") for x in arr]
You can call .replace() on a string as often as you like. An initial solution could be something like this:
my_array = ['€110.5M', '€111.5M', '€112.5M']
my_cleaned_array = []
for elem in my_array:
my_cleaned_array.append(elem.replace('€', '').replace('M', ''))
At this point, you still have strings in your array. If you want to return them as ints, you can write int(elem.replace('€', '').replace('M', '')) instead. But be aware that you will then lose everything after the floating point, i.e. you will end up with [110, 111, 112].
You can use Regex to do that.
import re
str = "€110.5M"
x = re.findall("\-?\d+\.\d+", str )
print(x)
I didn't quite understand the second part of the question.

Pair strings in list based on containing text in Python

I'm looking to take a list of strings and create a list of tuples that groups items based on whether they contain the same text.
For example, say I have the following list:
MyList=['Apple1','Pear1','Apple3','Pear2']
I want to pair them based on all but the last character of their string, so that I would get:
ListIWant=[('Apple1','Apple3'),('Pear1','Pear2')]
We can assume that only the last character of the string is used to identify. Meaning I'm looking to group the strings by the following unique values:
>>> list(set([x[:-1] for x in MyList]))
['Pear', 'Apple']
In [69]: from itertools import groupby
In [70]: MyList=['Apple1','Pear1','Apple3','Pear2']
In [71]: [tuple(v) for k, v in groupby(sorted(MyList, key=lambda x: x[:-1]), lambda x: x[:-1])]
Out[71]: [('Apple1', 'Apple3'), ('Pear1', 'Pear2')]
Consider this code:
def alphagroup(lst):
results = {}
for i in lst:
letter = i[0].lower()
if not letter in results.keys():
results[letter] = [i,]
else:
results[letter].append(i)
output = []
for k in results.keys():
res = results[k]
output.append(res)
return output
arr = ["Apple1", "Pear", "Apple2", "Pack"];
print alphagroup(arr);
This will achieve your goal. If each element must be a tuple, use the tuple() builtin in order to convert each element to a tuple. Hope this helps; I tested the code.

python intersect of dict items

Suppose I have a dict like:
aDict[1] = '3,4,5,6,7,8'
aDict[5] = '5,6,7,8,9,10,11,12'
aDict[n] = '5,6,77,88'
The keys are arbitrary, and there could be any number of them. I want to consider every value in the dictionary.
I want to treat each string as comma-separated values, and find the intersection across the entire dictionary (the elements common to all dict values). So in this case the answer would be '5,6'. How can I do this?
from functools import reduce # if Python 3
reduce(lambda x, y: x.intersection(y), (set(x.split(',')) for x in aDict.values()))
First of all, you need to convert these to real lists.
l1 = '3,4,5,6,7,8'.split(',')
Then you can use sets to do the intersection.
result = set(l1) & set(l2) & set(l3)
Python Sets are ideal for that task. Consider the following (pseudo code):
intersections = None
for value in aDict.values():
temp = set([int(num) for num in value.split(",")])
if intersections is None:
intersections = temp
else:
intersections = intersections.intersection(temp)
print intersections
result = None
for csv_list in aDict.values():
aList = csv_list.split(',')
if result is None:
result = set(aList)
else:
result = result & set(aList)
print result
Since set.intersection() accepts any number of sets, you can make do without any use of reduce():
set.intersection(*(set(v.split(",")) for v in aDict.values()))
Note that this version won't work for an empty aDict.
If you are using Python 3, and your dictionary values are bytes objects rather than strings, just split at b"," instead of ",".

Categories