I am currently filtering out all non-alphanumeric characters from this list.
cleanlist = []
for s in dirtylist:
s = re.sub("[^A-Za-z0-9]", "", str(s))
cleanlist.append(s)
What would be the most efficient way to also filter out whitespaces from this list?
this will strip whitespace from strings and wont add empty strings to your cleanlist
cleanlist = []
for s in dirtylist:
s = re.sub("[^A-Za-z0-9]", "", str(s).strip())
if s:
cleanlist.append(s)
I'd actually go and use list comprehension for this, but your code is already efficient.
pattern = re.compile("[^A-Za-z0-9]")
cleanlist = [pattern.sub('', s) for s in dirtylist if str(s)]
Also, this is a duplicate: Stripping everything but alphanumeric chars from a string in Python
The largest efficiency comes from using the full power of regular expression processing: don't iterate through the list.
Second, do not convert individual characters from string to string. Very simply:
cleanlist = re.sub("[^A-Za-z0-9]+", "", dirtylist)
Just to be sure, I tested this against a couple of list comprehension and string replacement methods; the above is the fastest by at least 20%.
Related
Suppose I have a expression
exp="\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50 and \"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston' or \"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20"
I want to split the expression by and , or so that my result will be
exp=['\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <50','\"OLS\".\"PRODUCTS\".\"PRODUCT_NAME\" = 'Kingston'','\"OLS\".\"ORDER_ITEMS\".\"QUANTITY\" <20']
This is what i have tried:
import re
res=re.split('and|or|',exp)
but it will split by each character how can we make it split by word?
import itertools
exp=itertools.chain(*[y.split('or') for y in exp.split('and')])
exp=[x.strip() for x in list(exp)]
Explanation: 1st split on 'and'. Now try spitting each element obtained on 'or'. This will create list of lists. Using itertools, create a flat list & strip extra spaces from each new element in the flat list
Your regex has three alternatives: "and", "or" or the empty string: and|or|
Omit the trailing | to split just by those two words.
import re
res=re.split('and|or', exp)
Note that this will not work reliably; it'll split on any instance of "and", even when it's in quotes or part of a word. You could make it split only on full words using \b, but that will still split on a product name like 'Black and Decker'. If you need it to be reliable and general, you'll have to parse the string using the full syntax (probably using an off-the-shelf parser, if it's standard SQL or similar).
You can do it in 2 steps: [ss for s in exp.split(" and ") for ss in s.split(' or ')]
I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?
Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']
Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)
Hopefully the same question hasn't already been answered (I looked but could not find).
I have a list of partial strings:
date_parts = ['/Year', '/Month', '/Day',....etc. ]
and I have a string.
E.g.
string1 = "Tag01/Source 01/Start/Year"
or
string1 = "Tag01/Source 01/Volume"
What is the most efficient way, apart from using a for loop, to check if any of the date_parts strings are contained within the string?
For info, string1 in reality is actually another list of many strings and I would like to remove any of these strings that contain a string within the date_parts list.
Compile a regex from the partial strings. Use re.escape() in case they contain control characters in the regex language.
import re
date_parts = ['/Year', '/Month', '/Day']
pattern = re.compile('|'.join(re.escape(s) for s in date_parts))
Then use re.search() to see if it matches.
string1 = "Tag01/Source 01/Start/Year"
re.search(pattern, string1)
The regex engine is probably faster than a native Python loop.
For your particular use case, consider concatenating all the strings, like
all_string = '\n'.join(strings+[''])
Then you can do them all at once in a single call to the regex engine.
pattern = '|'.join(f'.*{re.escape(s)}.*\n' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('\n')[:-1]
Of course, this assumes that none of your strings has a '\n'. You could pick some other character that's not in your strings to join and split on if necessary. '\f', for example, should be pretty rare. Here's how you might do it with '#'.
all_string = '#'.join(strings+[''])
pattern = '|'.join(f'[^#]*{re.escape(s)}[^#]*#' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('#')[:-1]
If that's still not fast enough, you could try a faster regex engine, like rure.
You can use the any function with a list comprehension. It should be a little faster than a for loop.
For one string, you can test like this:
any(p in string1 for p in date_parts)
If strings is a list of many strings you want to check, you could do this:
unmatched = [s for s in strings if not any(p in s for p in date_parts)]
or
unmatched = [s for s in strings if all(p not in s for p in date_parts)]
I have a large list with strings and I would like to filter everything inside a parenthesis, thus I am using the following regex:
text_list = [' 1__(this_is_a_string) 74_string__(anotherString_with_underscores) question__(stringWithAlot_of_underscores) 1.0__(another_withUnderscores) 23:59:59__(get_arguments_end) 2018-05-13 00:00:00__(get_arguments_start)']
import re
r = re.compile('\([^)]*\)')
a_lis = list(filter(r.search, text_list))
print(a_lis)
I test my regex here, and is working. However, when I apply the above regex I end up with an empty list:
[]
Any idea of how to filter all the tokens inside parenthesis from a list?
Your regex is OK (though perhaps you don't want to capture the parentheses as part of the match), but search() is the wrong method to use. You want findall() to get the text of all the matches, rather than the indices of the first match:
list(map(r.findall, text_list))
This will give you a list of lists, where each inner list contains the strings which were inside parentheses.
For example, given this input:
text_list = ['asdf (qwe) asdf (gdfd)', 'xx', 'gdfw(rgf)']
The result is:
[['(qwe)', '(gdfd)'], [], ['(rgf)']]
If you want to exclude the parentheses, change the regex slightly:
'\(([^)]*)\)'
The unescaped parentheses within the escaped ones indicate what to capture.
line = "english: while french: pendant que spanish: mientras german: whrend "
words = line.split('\t')
for each in words:
each = each.rstrip()
print words
the string in 'line' is tab delimited but also features a single white space character after each translated word, so while split returns the list I'm after, each word annoyingly has a whitespace character at the end of the string.
in the loop I'm trying to go through the list and remove any trailing whitespaces in the strings but it doest seem to work, suggestions?
Just line.split() could give you stripped words list.
Updating each inside the loop does not make any changes to the words list
Should be done like this
for i in range(len(words)):
words[i]=words[i].rstrip()
Or
words=map(str.rstrip,words)
See the map docs for details on map.
Or one liner with list comprehension
words=[x.rstrip() for x in line.split("\t")]
Or with regex .findall
words=re.findall("[^\t]+",line)
words = line.split('\t')
words = [ i.rstrip() for i in words ]
You can use a regular expression:
import re
words = re.split(r' *\t| +$', line)[:-1]
With this you define the possible sequence as the delimiter. It also allows more than one space because of the * operator (or no space at all).
EDIT: Fixed after Roger Pate pointed an error.