Most efficient way to split strings in Python

Most efficient way to split strings in Python - python

My current Python project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:
Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5
Explanation: This particular example would come from a list where the first two items are a title and a date, while item 3 to item 5 would be invited people (the number of those can be anything from zero to n, where n is the number of registered users on the server).
From what I see, I have the following options:
repeatedly use split()
Use a regular expression and regular expression functions
Some other Python functions I have not thought about yet (there are probably some)
Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:
((.+)|)+((.+)(<>)?)+
Okay, this regular expression is horrible, I can see that myself. It is also untested. But you get the idea.
Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for strings that have more items separated with | and strings that completely lack the <>. At least the regular expression-based solution would do that.
My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Python's implementation of regular expressions to judge how the regular expression would perform. split() is also less dynamic than a regular expression if it comes to different numbers of items and the absence of the second separator. Still, I can't shake the impression that Python can do this better without regular expressions, and that's why I am asking.
Some notes:
Yes, I could just benchmark both solutions, but I'm trying to learn something about Python in general and how it works here, and if I just benchmark these two, I still don't know what Python functions I have missed.
Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about Python.
Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were separated by | from the parts with the separator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by obmarg) will not work too well. Solutions fitting this criterium are much appreciated.
To sum the question up: Which solution would be the most efficient one, for what reasons?
Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by obmarg, as well as the solutions by mgibsonbr and duncan:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "\||<>", input )
def mgibsonbr(input): # Solution by mgibsonbr
items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular items, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
return result
def split2(input): # Solution by duncan
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
The results:
mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646
At the moment, it looks like split2 by duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like mgibsonbr's solution has some performance issues (sorry about that, but thanks for the solution regardless).

I was slightly surprised that split() performed so badly in your code, so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def split2(input):
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
def regexit(input):
return re.split( "\||<>", input )
rSplitter = re.compile("\||<>")
def regexit2(input):
return rSplitter.split(input)
print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
Which gives the following result:
split: 1.8427431439631619
split2: 1.0897291360306554
regex: 1.6694280610536225
regex2: 1.2277749050408602
split: 14.356198082969058
split2: 8.009285948995966
regex: 9.526430513011292
regex2: 9.083608677960001
And of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.
Compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:
>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> re.split( "\||<>", input )
>>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']
I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.
Though having said that I could easily be wrong, and the only way to be sure would be to time it.

Calling split multiple times is likely to be inneficient, because it might create unneeded intermediary strings. Using a regex like you proposed won't work, since the capturing group will only get the last item, not every of them. Splitting using a regex, like obmarg suggested, seems to be the best route, assuming a "flattened" list is what you're looking for.
If you don't want a flattened list, you can split using a regex first and then iterate over the results, checking the original input to see which delimiter was used:
items = re.split(r'\||<>', input)
offset = 0
for i in items:
delimiter = '|' if input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
# Do something with i, depending on whether | or <> was the delimiter
At last, if you don't want the substrings created at all (using only the start and end indices to save space, for instance), re.finditer might do the job. Iterate over the delimiters, and do something to the text between them depending on which delimiter (| or <>) was found. It's a more complex operation, since you'll have to handle many corner cases, but might be worth it depending on your needs.
Update: for your particular case, where the input format is uniform, obmarg's solutions is the best one. If you must, post-process the result to have a nested list:
split_result = re.split( "\||<>", input )
result = [split_result[0], split_result[1], [i for i in split_result[2:] if i]]
(that last list comprehension is to ensure you'll get [] instead of [''] if there are no items after the last |)
Update 2: After reading the updated question, I finally understood what you're trying to achieve. Here's the full example, using the framework suggested earlier:
items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular itens, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
print result
Tested it with your example, the result was:
['a', 'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c','de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'aha'],
'b', 'c', 'de', ['f', 'ge', 'ah']]

If you know that <> is not going to appear elsewhere in the string then you could replace '<>' with '|' followed by a single split:
>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> input.replace("<>", "|").split("|")
['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']
This will almost certainly be faster than doing multiple splits. It may or may not be faster than using re.split - timeit is your friend.
On my system with the sample string you supplied, my version is more than three times faster than re.split:
>>> timeit input.replace("<>", "|").split("|")
1000000 loops, best of 3: 980 ns per loop
>>> import re
>>> timeit re.split(r"\||<>", input)
100000 loops, best of 3: 3.07 us per loop
(N.B.: This is using IPython, which has timeit as a built-in command).

You can make use of replace. First replace <> with |, and then split by |.
def replace_way(input):
return input.replace('<>','|').split('|')
Time performance:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "\||<>", input )
def replace_way(input):
return input.replace('<>','|').split('|')
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()
Results on my machine:
split: 11.8682055461
regex: 12.7430856814
replace: 2.54299265006
split: 79.2124379066
regex: 68.6917008003
replace: 10.944842347

Related

Common substring in list of strings

i encountered a problem while trying to solve a problem where given some strings and their lengths, you need to find their common substring. My code for the part where it loops through the list and then each through each word in it is this:
num_of_cases = int(input())
for i in range(1, num_of_cases+1):
if __name__ == '__main__':
len_of_str = list(map(int, input().split()))
len_of_virus = int(input())
strings = []
def string(strings, len_of_str):
len_of_list = len(len_of_str)
for i in range(1, len_of_list+1):
strings.append(input())
lst_of_subs = []
virus_index = []
def substr(strings, len_of_virus):
for word in strings:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
lst_of_subs.append(leng)
virus_index.append(i)
print(string(strings, len_of_str))
print(substr(strings, len_of_virus))
And it prints the following given the strings: ananasso, associazione, tassonomia, massone
['anan', 'nan', 'an', 'n', 'asso', 'sso', 'so', 'o', 'tass', 'ass', 'ss', 's', 'mass', 'ass', 'ss', 's']
It seems that the end index doesn't increase, although i tried it by writing len_of_virus += 1 at the end of the loop.
sample input:
1
8 12 10 7
4
ananasso
associazione
tassonomia
massone
where the 1st letter is the number of cases, the second line is the name of the strings, 3rd is the length of the virus(the common substring), and then there are the given strings that i should loop through.
expected output:
Case #1: 4 0 1 1
where the four numbers are the starting indexes of the common substring.(i dont think that code for printing cares us for this particular problem)
What should i do? Please help!!

The problem, beside defining functions in odd places and using said function to get side effect in ways that aren't really encourage, is here:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
i constantly increase in each iteration, but len_of_virus stay the same, so you are effectively doing
word[0:4] #when len_of_virus=4
word[1:4]
word[2:4]
word[3:4]
...
that is where the 'anan', 'nan', 'an', 'n', come from the first word "ananasso", and the same for the other
>>> word="ananasso"
>>> len_of_virus = 4
>>> for i in range(len(word)):
word[i:len_of_virus]
'anan'
'nan'
'an'
'n'
''
''
''
''
>>>
you can fix it moving the upper end by i, but that leave with the same problem in the other end
>>> for i in range(len(word)):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
'sso'
'so'
'o'
>>>
so some simple adjustments in the range and problem solve:
>>> for i in range(len(word)-len_of_virus+1):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
>>>
Now that the substring part is done, the rest is also easy
>>> def substring(text,size):
return [text[i:i+size] for i in range(len(text)-size+1)]
>>> def find_common(lst_text,size):
subs = [set(substring(x,size)) for x in lst_text]
return set.intersection(*subs)
>>> test="""ananasso
associazione
tassonomia
massone""".split()
>>> find_common(test,4)
{'asso'}
>>>
To find the common part to all the strings in our list we can use a set, first we put all the substring of a given word into a set and finally we intersect them all.
the rest is just printing it to your liking
>>> virus = find_common(test,4).pop()
>>> print("case 1:",*[x.index(virus) for x in test])
case 1: 4 0 1 1
>>>

First extract all the substrings of the give size from the shortest string. Then select the first of these substrings that is present in all of the strings. Finally output the position of this common substring in each of the strings:
def commonSubs(strings,size):
base = min(strings,key=len) # shortest string
subs = [base[i:i+size] for i in range(len(base)-size+1)] # all substrings
cs = next(ss for ss in subs if all(ss in s for s in strings)) # first common
return [s.index(cs) for s in strings] # indexes of common substring
output:
S = ["ananasso", "associazione", "tassonomia", "massone"]
print(commonSubs(S,4))
[4, 0, 1, 1]
You could also use a recursive approach:
def commonSubs(strings,size,i=0):
sub = strings[0][i:i+size]
if all(sub in s for s in strings):
return [s.index(sub) for s in strings]
return commonSubs(strings,size,i+1)

from suffix_trees import STree
STree.STree(["come have some apple pies",
'apple pie available',
'i love apple pie haha']).lcs()
the most simple way is use STree

list comprehension without if but with else

My question aims to use the else condition of a for-loop in a list comprehension.
example:
empty_list = []
def example_func(text):
for a in text.split():
for b in a.split(","):
empty_list.append(b)
else:
empty_list.append(" ")
I would like to make it cleaner by using a list comprehension with both for-loops.
But how can I do this by including an escape-clause for one of the loops (in this case the 2nd).
I know I can use if with and without else in a list comprehension. But how about using else without an if statement.
Is there a way, so the interpreter will understand it as escape-clause of a for loop?
Any help is much appreciated!
EDIT:
Thanks for the answers! In fact im trying to translate morse code.
The input is a string, containing morse codes.
Each word is separated by 3 spaces. Each letter of each word is separated by 1 space.
def decoder(code):
str_list = []
for i in code.split(" "):
for e in i.split():
str_list.append(morse_code_dic[e])
else:
str_list.append(" ")
return "".join(str_list[:-1]).capitalize()
print(decoder(".. - .-- .- ... .- --. --- --- -.. -.. .- -.--"))
I want to break down the whole sentence into words, then translate each word.
After the inner loop (translation of one word) is finished, it will launch its escape-clause else, adding a space, so that the structure of the whole sentence will be preserved. That way, the 3 Spaces will be translated to one space.

As noted in comments, that else does not really make all that much sense, since the purpose of an else after a for loop is actually to hold code for conditional execution if the loop terminates normally (i.e. not via break), which your loop always does, thus it is always executed.
So this is not really an answer to the question how to do that in a list comprehension, but more of an alternative. Instead of adding spaces after all words, then removing the last space and joining everything together, you could just use two nested join generator expressions, one for the sentence and one for the words:
def decoder(code):
return " ".join("".join(morse_code_dic[e] for e in i.split())
for i in code.split(" ")).capitalize()

As mentioned in the comments, the else clause in your particular example is pointless because it always runs. Let's contrive an example that would let us investigate the possibility of simulating a break and else.
Take the following string:
s = 'a,b,c b,c,d c,d,e, d,e,f'
Let's say you wanted to split the string by spaces and commas as before, but you only wanted to preserve the elements of the inner split up to the first occurrence of c:
out = []
for i in s.split():
for e in i.split(','):
if e == 'c':
break
out.append(e)
else:
out.append('-')
The break can be simulated using the arcane two-arg form of iter, which accepts a callable and a termination value:
>>> x = list('abcd')
>>> list(iter(iter(x).__next__, 'c'))
['a', 'b']
You can implement the else by chaining the inner iterable with ['-'].
>>> from itertools import chain
>>> x = list('abcd')
>>> list(iter(chain(x, ['-'])
.__next__, 'c'))
['a', 'b']
>>> y = list('def')
>>> list(iter(chain(y, ['-'])
.__next__, 'c'))
['d', 'e', 'f', '-']
Notice that the placement of chain is crucial here. If you were to chain the dash to the outer iterator, it would always be appended, not only when c is not encountered:
>>> list(chain(iter(iter(x).__next__, 'c'), ['-']))
['a', 'b', '-']
You can now simulate the entire nested loop with a single expression:
from itertools import chain
out = [e for i in s.split() for e in iter(chain(i.split(','), ['-']).__next__, 'c')]

How to understand the result of list comprehension of nested lists when the order is reversed?

I'm trying to extract numbers that are mixed in sentences. I am doing this by splitting the sentence into elements of a list, and then I will iterate through each character of each element to find the numbers. For example:
String = "is2 Thi1s T4est 3a"
LP = String.split()
for e in LP:
for i in e:
if i in ('123456789'):
result += i
This can give me the result I want, which is ['2', '1', '4', '3']. Now I want to write this in list comprehension. After reading the List comprehension on a nested list?
post I understood that the right code shall be:
[i for e in LP for i in e if i in ('123456789') ]
My original code for the list comprehension approach was wrong, but I'm trying to wrap my heads around the result I get from it.
My original incorrect code, which reversed the order:
[i for i in e for e in LP if i in ('123456789') ]
The result I get from that is:
['3', '3', '3', '3']
Could anyone explain the process that leads to this result please?

Just reverse the same process you found in the other post. Nest the loops in the same order:
for i in e:
for e in LP:
if i in ('123456789'):
print(i)
The code requires both e and LP to be set beforehand, so the outcome you see depends entirely on other code run before your list comprehension.
If we presume that e was set to '3a' (the last element in LP from your code that ran full loopss), then for i in e will run twice, first with i set to '3'. We then get a nested loop, for e in LP, and given your output, LP is 4 elements long. So that iterates 4 times, and each iteration, i == '3' so the if test passes and '3' is added to the output. The next iteration of for i in e: sets i = 'a', the inner loop runs 4 times again, but not the if test fails.
However, we can't know for certain, because we don't know what code was run last in your environment that set e and LP to begin with.
I'm not sure why your original code uses str.split(), then iterates over all the characters of each word. Whitespace would never pass your if filter anyway, so you could just loop directly over the full String value. The if test can be replaced with a str.isdigit() test:
digits = [char for char in String if char.isdigit()]
or a even a regular expression:
digits = re.findall(r'\d', String)
and finally, if this is a reordering puzzle, you'd want to split out your strings into a number (for ordering) and the remainder (for joining); sort the words on the extracted number, and extract the remainder after sorting:
# to sort on numbers, extract the digits and turn to an integer
sortkey = lambda w: int(re.search(r'\d+', w).group())
# 'is2' -> 2, 'Th1s1' -> 1, etc.
# sort the words by sort key
reordered = sorted(String.split(), key=sortkey)
# -> ['Thi1s', 'is2', '3a', 'T4est']
# replace digits in the words and join again
rejoined = ' '.join(re.sub(r'\d+', '', w) for w in reordered)
# -> 'This is a Test'

From the question you asked in a comment ("how would you proceed to reorder the words using the list that we got as index?"):
We can use custom sorting to accomplish this. (Note that regex is not required, but makes it slightly simpler. Use any method to extract the number out of the string.)
import re
test_string = 'is2 Thi1s T4est 3a'
words = test_string.split()
words.sort(key=lambda s: int(re.search(r'\d+', s).group()))
print(words) # ['Thi1s', 'is2', '3a', 'T4est']
To remove the numbers:
words = [re.sub(r'\d', '', w) for w in words]
Final output is:
['This', 'is', 'a', 'Test']

counting number of each substring in array python

I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?

You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})

Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142

If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.

You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})

Check if string variations are within a string

I should note that we are only allowed to use built in python string functions and loop functions.
A = 'bet[bge]geee[tb]bb'
B = 'betggeeetbb'
The square brackets mean any single one of the characters inside the bracket can be used so you could have
betbgeeetbb
betggeeetbb
betegeeetbb
betbgeeebbb
betggeeebbb
betegeeebbb
How do I check A will have a combination that can be found within B.
A can have any number of brackets, with a minimum of 2 characters and a maximum of 4 characters in each square bracket
Thank you

Read up on the regular expressions library. The solution is literally the re.match function, whose documentation includes the following bit:
[] Used to indicate a set of characters. In a set:
Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
Since regular expressions use backslashes for their own purpose (beyond Python's normal escapes, e.g. "\n" to represent a newline), raw strings are idiomatic in the matching string.
>>> import re
>>> A = r'bet[bge]geee[tb]bb'
>>> B = 'betggeeetbb'
>>> m = re.match(A, B)
>>> m
<_sre.SRE_Match object; span=(0, 11), match='betggeeetbb'>
>>> m.group(0)
'betggeeetbb'
You can also verify that it doesn't match if (say) the second bracket is not matched:
>>> C = "betggeeezbb"
>>> m = re.match(A, C)
>>> m is None
True
Before you go about adding this liberally to an existing project, make sure you understand:
What is the difference between re.search and re.match?
What is the cost of creating a regular expression? How can you avoid this cost if the regular expression is used repeatedly?
How can you extract parts of a matching expression (e.g. the character matched by [bge] in your example)?
How can you perform matches on strings that contain newlines?
Finally, when learning regular expressions (similarly to learning class inheritance), it's tempting to use them everywhere. Meditate on this koan from Jamie Zawinski:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

It's easiest to break your problem up into simpler tasks. There are many ways to convert your pattern from just a plain string into something with more structure, but here's something that uses only plain string operations to get you started:
def parse_pattern(pattern):
'''
>>> parse_pattern('bet[bge]geee[tb]bb')
['b', 'e', 't', ['b', 'g', 'e'], 'g', 'e', 'e', 'e', ['t', 'b'], 'b', 'b']
'''
in_group = False
group = []
result = []
# Iterate through the pattern, character by character
for c in pattern:
if in_group:
# If we're currently parsing a character
# group, we either add a char into current group
# or we end the group and go back to looking at
# normal characters
if c == ']':
in_group = False
result.append(group)
group = []
else:
group.append(c)
elif c == '[':
# A [ starts a character group
in_group = True
else:
# Otherwise, we just handle normal characters
result.append(c)
return result
def check_if_matches(string, pattern):
parsed_pattern = parse_pattern(pattern)
# Useful thing to note: `string` and `parsed_pattern`
# have the same number of elements once we parse the
# `pattern`
...
if __name__ == '__main__':
print(check_if_matches('betggeeetbb', 'bet[bge]geee[tb]bb'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most efficient way to split strings in Python - python

Related

Common substring in list of strings

list comprehension without if but with else

How to understand the result of list comprehension of nested lists when the order is reversed?

counting number of each substring in array python

Check if string variations are within a string

Categories

Resources