counting number of each substring in array python - python

I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?

You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})

Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142

If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.

You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})

Related

Beautify/shorten concatenated for loops

I am checking a list of strings if they contain specific substrings. Depending on the conditions, the string from the list is added to another list, or is left out respectively.
This is what I have so far. It is working, but there are quite some loops concatenated. Is there a better (or more pythonesque) way of writing this?
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
sub_incl = ['foo']
sub_excl = ['bar']
for i in range(len(varNams)):
for sub_in in sub_incl:
for sub_ex in sub_excl:
if sub_in in varNams[i] and sub_ex not in varNams[i]:
selection.append(varNams[i])
You can use itertools.product in a list comprehension.
from itertools import product
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
sub_incl = ['foo']
sub_excl = ['bar']
res = [i for i in varNams for sub_in, sub_ex in product(sub_incl, sub_excl) if sub_in in i and sub_ex not in i]
print(res)
Output
['fee1.foo', 'fee2.foo']
My solution is using regex
import re
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
regex1='foo'
regex2='bar'
for i in varNams:
if re.search(regex1, i) and not re.search(regex2, i):
selection.append(i)

How to perform in-place removal of duplicates from a string in Python?

I am trying to implement an inplace algorithm to remove duplicates from a string in Python.
str1 = "geeksforgeeks"
for i in range(len(str1)):
for j in range(i+1,len(str1)-1):
if str1[i] == str1[j]: //Error Line
str1 = str1[0:j]+""+str1[j+1:]
print str1
In the above code, I am trying to replace the duplicate character with whitespace. But I get IndexError: string index out of range at if str1[i] == str1[j]. Am I missing out on something or is it not the right way?
My expected output is: geksfor
You can do all of this with just a set and a comprehension. No need to complicate things.
str1 = "geeksforgeeks"
seen = set()
seen_add = seen.add
print(''.join(s for s in str1 if not (s in seen or seen_add(s))))
#geksfor
"Simple is better than complex."
~ See PEP20
Edit
While the above is more simple than your answer, it is the most performant way of removing duplicates from a collection the more simple solution would be to use:
from collections import OrderedDict
print("".join(OrderedDict.fromkeys(str1)))
It is impossible to modify strings in-place in Python, the same way that it's impossible to modify numbers in-place in Python.
a = "something"
b = 3
b += 1 # allocates a new integer, 4, and assigns it to b
a += " else" # allocates a new string, " else", concatenates it to `a` to produce "something else"
# then assigns it to a
As already pointed str is immutable, so in-place requirement make no sense.
If you want to get desired output I would do it following way:
str1 = 'geeksforgeeks'
out = ''.join([i for inx,i in enumerate(str1) if str1.index(i)==inx])
print(out) #prints: geksfor
Here I used enumerate function to get numerated (inx) letters and fact that .index method of str, returns lowest possible index of element therefore str1.index('e') for given string is 1, not 2, not 9 and not 10.
Here is a simplified version of unique_everseen from itertools recipes.
from itertools import filterfalse
def unique_everseen(iterable)
seen = set()
see _ add = seen.add
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
You can then use this generator with str.join to get the expected output.
str1 = "geeksforgeeks"
new_str1 = ''.join(unique_everseen(str1)) # 'geksfor'

Reverse strings in a nested list without slicing or reversed

In python if my list is
TheTextImage = [["111000"],["222999"]]
How would one loop through this list creating a new one of
NewTextImage = [["000111"],["999222"]]
Can use [:] but not [::-1], and cannot use reverse()
You know how to copy a sequence to another sequence one by one, right?
new_string = ''
for ch in old_string:
new_string = new_string + ch
If you want to copy the sequence in reverse, just add the new values onto the left instead of onto the right:
new_string = ''
for ch in old_string:
new_string = ch + new_string
That's really the only trick you need.
Now, this isn't super-efficient, because string concatenation takes quadratic time. You could solve this by using a collections.deque (which you can append to the left of in constant time) and then calling ''.join at the end. But I doubt your teacher is expecting that from you. Just do it the simple way.
Of course you have to loop over TextImage applying this to every string in every sublist in the list. That's probably what they're expecting you to use [:] for. But that's easy; it's just looping over lists.
You may not use [::-1] but you can multiply each range index by -1.
t = [["111000"],["222999"]]
def rev(x):
return "".join(x[(i+1)*-1] for i in range(len(x)))
>>> [[rev(x) for x in z] for z in t]
[['000111'], ['999222']]
If you may use the step arg in range, can do AChampions suggestion:
def rev(x):
return ''.join(x[i-1] for i in range(0, -len(x), -1))
If you can't use any standard functionality such as reversed or [::-1], you can use collections.deque and deque.appendleft in a loop. Then use a list comprehension to apply the logic to multiple items.
from collections import deque
L = [["111000"], ["222999"]]
def reverser(x):
out = deque()
for i in x:
out.appendleft(i)
return ''.join(out)
res = [[reverser(x[0])] for x in L]
print(res)
[['000111'], ['999222']]
Note you could use a list, but appending to the beginning of a list is inefficient.
You can use reduce(lambda x,y: y+x, string) to reverse a string
>>> from functools import reduce
>>> TheTextImage = [["111000"],["222999"]]
>>> [[reduce(lambda x,y: y+x, b) for b in a] for a in TheTextImage]
[['000111'], ['999222']]

How can you terminate a string after k consecutive numbers have been found?

Say I have some list with files of the form *.1243.*, and I wish to obtain everything before these 4 digits. How do I do this efficiently?
An ugly, inefficient example of working code is:
names = []
for file in file_list:
words = file.split('.')
for i, word in enumerate(words):
if word.isdigit():
if int(word)>999 and int(word)<10000:
names.append(' '.join(words[:i]))
break
print(names)
Obviously though, this is far from ideal and I was wondering about better ways to do this.
You may want to use regular expressions for this.
import re
name = []
for file in file_list:
m = re.match(r'^(.+?)\.\d{4}\.', file)
if m:
name.append(m.groups()[0])
Using a regular expression, this would become simpler
import re
names = ['hello.1235.sas','test.5678.hai']
for fn in names:
myreg = r'(.*)\.(?:\d{4})\..*'
output = re.findall(myreg,fn)
print(output)
output:
['hello']
['test']
If you know that all entries has the same format, here is list comprehension approach:
[item[0] for item in filter(lambda start, digit, end: len(digit) == 4, (item.split('.') for item in file_list))]
To be fair I also like solution, provided by #James. Note, that downside of this list comprehension is three loops:
1. On all items to split
2. Filtering all items, that match
3. Returning result.
With regular for loop it could be be more sufficient:
output = []
for item in file_list:
begging, digits, end = item.split('.')
if len(digits) == 4:
output.append(begging)
It does only one loop, which way better.
You can use Positive Lookahead (?=(\.\d{4}))
import re
pattern=r'(.*)(?=(\.\d{4}))'
text=['*hello.1243.*','*.1243.*','hello.1235.sas','test.5678.hai','a.9999']
print(list(map(lambda x:re.search(pattern,x).group(0),text)))
output:
['*hello', '*', 'hello', 'test', 'a']

What is the fastest way to generate a list of the lengths of sub-strings in a string, given a separator?

I have a string and I need to generate a list of the lengths of all the sub-strings terminating in a given separator.
For example: string = 'a0ddb0gf0', separator = '0', so I need to generate: lengths = [2,4,3], since len('a0')==2, len('ddb0')=4, and len('gf0')==3.
I am aware that it can be done by the following (for example):
separators = [index for index in range(len(string)) if string[index]==separator]
lengths = [separators[index+1] - separators[index] for index in range(len(separators)-1)]
But I need it to be done extremely fast (on large amounts of data). Generating an intermediate list for large amounts of data is time consuming.
Is there a solution that does this neatly and fast (py2.7)?
Fastest? Don't know. You might like to profile it.
>>> print [len(s) for s in 'a0ddb0gf0'.split('0')]
[1, 3, 2, 0]
And, if you really don't want to include zero length strings:
>>> print [len(s) for s in 'a0ddb0gf0'.split('0') if s]
[1, 3, 2]
Personally, I love itertools.groupby()
>>> from itertools import groupby
>>> sep = '0'
>>> data = 'a0ddb0gf0'
>>> [sum(1 for i in g) for (k, g) in groupby(data, sep.__ne__) if k]
[1, 3, 2]
This groups the data according to whether each element is equal to the separator, then gets the length of each group for which the element was not equal (by summing 1's for each item in the group).
itertools functions are generally quite fast, though I don't know for sure how much better than split() this is. The one point that I think is strongly in its favor is that this can seamlessly handle multiple consecutive occurrences of the separator character. It will also handle any iterable for data, not just strings.
I don't know how fast this will go, but here's another way:
def len_pieces(s, sep):
i = 0
while True:
f = s.find(sep, i)
if f == -1:
yield len(s) - i
return
yield f - i + 1
i = f + 1
>>> [len(i) for i in re.findall('.+?0', 'a0ddb0gf0')]
[2, 4, 3]
You may use re.finditer to avoid an intermediary list, but it may not be much different in performance:
[len(i.group(0)) for i in re.finditer('.+?0', 'a0ddb0gf0')]
Maybe using an re:
[len(m.group()) for m in re.finditer('(.*?)0', s)]

Categories