I am checking a list of strings if they contain specific substrings. Depending on the conditions, the string from the list is added to another list, or is left out respectively.
This is what I have so far. It is working, but there are quite some loops concatenated. Is there a better (or more pythonesque) way of writing this?
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
sub_incl = ['foo']
sub_excl = ['bar']
for i in range(len(varNams)):
for sub_in in sub_incl:
for sub_ex in sub_excl:
if sub_in in varNams[i] and sub_ex not in varNams[i]:
selection.append(varNams[i])
You can use itertools.product in a list comprehension.
from itertools import product
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
sub_incl = ['foo']
sub_excl = ['bar']
res = [i for i in varNams for sub_in, sub_ex in product(sub_incl, sub_excl) if sub_in in i and sub_ex not in i]
print(res)
Output
['fee1.foo', 'fee2.foo']
My solution is using regex
import re
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
regex1='foo'
regex2='bar'
for i in varNams:
if re.search(regex1, i) and not re.search(regex2, i):
selection.append(i)
Related
I have an array I want to iterate through. The array consists of strings consisting of numbers and signs.
like this: €110.5M
I want to loop over it and remove all Euro sign and also the M and return that array with the strings as ints.
How would I do this knowing that the array is a column in a table?
You could just strip the characters,
>>> x = '€110.5M'
>>> x.strip('€M')
'110.5'
def sanitize_string(ss):
ss = ss.replace('$', '').replace('€', '').lower()
if 'm' in ss:
res = float(ss.replace('m', '')) * 1000000
elif 'k' in ss:
res = float(ss.replace('k', '')) * 1000
return int(res)
This can be applied to a list as follows:
>>> ls = [sanitize_string(x) for x in ["€3.5M", "€15.7M" , "€167M"]]
>>> ls
[3500000, 15700000, 167000000]
If you want to apply it to the column of a table instead:
dataFrame = dataFrame.price.apply(sanitize_string) # Assuming you're using DataFrames and the column is called 'price'
You can use a string comprehension:
numbers = [float(p.replace('€','').replace('M','')) for p in a]
which gives:
[110.5, 210.5, 310.5]
You can use a list comprehension to construct one list from another:
foo = ["€13.5M", "€15M" , "€167M"]
foo_cleaned = [value.translate(None, "€M")]
str.translate replaces all occurrences of characters in the latter string with the first argument None.
Try this
arr = ["€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M"]
f = [x.replace("€","").replace("M","") for x in arr]
You can call .replace() on a string as often as you like. An initial solution could be something like this:
my_array = ['€110.5M', '€111.5M', '€112.5M']
my_cleaned_array = []
for elem in my_array:
my_cleaned_array.append(elem.replace('€', '').replace('M', ''))
At this point, you still have strings in your array. If you want to return them as ints, you can write int(elem.replace('€', '').replace('M', '')) instead. But be aware that you will then lose everything after the floating point, i.e. you will end up with [110, 111, 112].
You can use Regex to do that.
import re
str = "€110.5M"
x = re.findall("\-?\d+\.\d+", str )
print(x)
I didn't quite understand the second part of the question.
In python if my list is
TheTextImage = [["111000"],["222999"]]
How would one loop through this list creating a new one of
NewTextImage = [["000111"],["999222"]]
Can use [:] but not [::-1], and cannot use reverse()
You know how to copy a sequence to another sequence one by one, right?
new_string = ''
for ch in old_string:
new_string = new_string + ch
If you want to copy the sequence in reverse, just add the new values onto the left instead of onto the right:
new_string = ''
for ch in old_string:
new_string = ch + new_string
That's really the only trick you need.
Now, this isn't super-efficient, because string concatenation takes quadratic time. You could solve this by using a collections.deque (which you can append to the left of in constant time) and then calling ''.join at the end. But I doubt your teacher is expecting that from you. Just do it the simple way.
Of course you have to loop over TextImage applying this to every string in every sublist in the list. That's probably what they're expecting you to use [:] for. But that's easy; it's just looping over lists.
You may not use [::-1] but you can multiply each range index by -1.
t = [["111000"],["222999"]]
def rev(x):
return "".join(x[(i+1)*-1] for i in range(len(x)))
>>> [[rev(x) for x in z] for z in t]
[['000111'], ['999222']]
If you may use the step arg in range, can do AChampions suggestion:
def rev(x):
return ''.join(x[i-1] for i in range(0, -len(x), -1))
If you can't use any standard functionality such as reversed or [::-1], you can use collections.deque and deque.appendleft in a loop. Then use a list comprehension to apply the logic to multiple items.
from collections import deque
L = [["111000"], ["222999"]]
def reverser(x):
out = deque()
for i in x:
out.appendleft(i)
return ''.join(out)
res = [[reverser(x[0])] for x in L]
print(res)
[['000111'], ['999222']]
Note you could use a list, but appending to the beginning of a list is inefficient.
You can use reduce(lambda x,y: y+x, string) to reverse a string
>>> from functools import reduce
>>> TheTextImage = [["111000"],["222999"]]
>>> [[reduce(lambda x,y: y+x, b) for b in a] for a in TheTextImage]
[['000111'], ['999222']]
I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?
You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142
If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.
You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
Say I have some list with files of the form *.1243.*, and I wish to obtain everything before these 4 digits. How do I do this efficiently?
An ugly, inefficient example of working code is:
names = []
for file in file_list:
words = file.split('.')
for i, word in enumerate(words):
if word.isdigit():
if int(word)>999 and int(word)<10000:
names.append(' '.join(words[:i]))
break
print(names)
Obviously though, this is far from ideal and I was wondering about better ways to do this.
You may want to use regular expressions for this.
import re
name = []
for file in file_list:
m = re.match(r'^(.+?)\.\d{4}\.', file)
if m:
name.append(m.groups()[0])
Using a regular expression, this would become simpler
import re
names = ['hello.1235.sas','test.5678.hai']
for fn in names:
myreg = r'(.*)\.(?:\d{4})\..*'
output = re.findall(myreg,fn)
print(output)
output:
['hello']
['test']
If you know that all entries has the same format, here is list comprehension approach:
[item[0] for item in filter(lambda start, digit, end: len(digit) == 4, (item.split('.') for item in file_list))]
To be fair I also like solution, provided by #James. Note, that downside of this list comprehension is three loops:
1. On all items to split
2. Filtering all items, that match
3. Returning result.
With regular for loop it could be be more sufficient:
output = []
for item in file_list:
begging, digits, end = item.split('.')
if len(digits) == 4:
output.append(begging)
It does only one loop, which way better.
You can use Positive Lookahead (?=(\.\d{4}))
import re
pattern=r'(.*)(?=(\.\d{4}))'
text=['*hello.1243.*','*.1243.*','hello.1235.sas','test.5678.hai','a.9999']
print(list(map(lambda x:re.search(pattern,x).group(0),text)))
output:
['*hello', '*', 'hello', 'test', 'a']
I'm trying to compare multiple lists. However the lists aren't label...normally. I'm using a while loop to make a new list each time and label them accordingly. So for example, if the while loop runs 3 times it will make a List1 a List2 and List3. Here is then snippet of the code to create the list.
for link in links:
print('*', link.text)
locals()['list{}'.format(str(i))].append(link.text)
So I want to compare each list for the strings that are in them but I want to compare all the lists at once then print out the common strings.
I feel like I'll be using something like this, but I'm not 100% sure.
lists = [list1, list2, list3, list4, list5, list6, list7, list8, list9, list10]
common = list(set().union(*lists).intersection(Keyword))
Rather than directly modifying locals() (generally not a good idea), use a defaultdict as a container. This data structure allows you to create new key-value pairs on the fly rather than relying on a method which is sure to lead to a NameError at some point.
from collections import defaultdict
i = ...
link_lists = defaultdict(list)
for link in links:
print('*', link.text)
link_lists[i].append(link.text)
To find the intersection of all of the lists:
all_lists = list(link_lists.values())
common_links = set(all_lists[0]).intersection(*all_lists[1:])
In Python 2.6+, you can pass multiple iterables to set.intersection. This is what the star-args do here.
Here's an example of how the intersection will work:
>>> from collections import defaultdict
>>> c = defaultdict(list)
>>> c[9].append("a")
>>> c[0].append("b")
>>> all = list(c.values())
>>> set(all[0]).intersection(*all[1:])
set()
>>> c[0].append("a")
>>> all = list(c.values())
>>> set(all[0]).intersection(*all[1:])
{'a'}
You have several options,
option a)
use itertools to get a cartesian product, this is quite nice because its an iterator
a = ["A", "B", "C"]
b = ["A","C"]
c = ["C","D","E"]
for aval,bval,cval in itertools.product(a,b,c):
if aval == bval and bval == cval:
print aval
option b)
Use sets (recommended):
all_lists = []
# insert your while loop X times
for lst in lists: # This is my guess of your loop running.
currentList = map(lambda x: x.link, links)
all_lists.append(currentList) # O(1) operation
result_set = set()
if len(all_lists)>1:
result_set = set(all_lists[0]).intersection(*all_lists[1:])
else:
result_set = set(all_lists[0])
Using the sets, however, will be faster