Removing empty strings from a re.findall command [duplicate] - python

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 4 years ago.
import re
name = 'propane'
a = []
Alkane = re.findall('(\d+\W+)*(methyl|ethyl|propyl|butyl)*(meth|eth|prop|but|pent|hex)(ane)', name)
if Alkane != a:
print(Alkane)
As you can see when the regular express takes in propane it will output two empty strings.
[('', '', 'prop', 'ane')]
For these types of inputs, I want to remove the empty strings from the output. I don't know what kind of form this output is in though, it doesn't look like a regular list.

You can use str.split() and str.join() to remove empty strings from your output:
>>> import re
>>> name = 'propane'
>>> Alkane = re.findall('(\d+\W+)*(methyl|ethyl|propyl|butyl)*(meth|eth|prop|but|pent|hex)(ane)', name)
>>> Alkane
[('', '', 'prop', 'ane')]
>>> [tuple(' '.join(x).split()) for x in Alkane]
[('prop', 'ane')]
Or using filter():
[tuple(filter(None, x)) for x in Alkane]

You can use filter to remove empty strings:
import re
name = 'propane'
a = []
Alkane = list(map(lambda m: tuple(filter(bool, m)), re.findall('(\d+\W+)*(methyl|ethyl|propyl|butyl)*(meth|eth|prop|but|pent|hex)(ane)', name)))
if Alkane != a:
print(Alkane)
Or you can use list/tuple comprehension:
import re
name = 'propane'
a = []
Alkane = [tuple(i for i in m if i) for m in re.findall('(\d+\W+)*(methyl|ethyl|propyl|butyl)*(meth|eth|prop|but|pent|hex)(ane)', name)]
if Alkane != a:
print(Alkane)
Both output:
[('prop', 'ane')]

It is stated in the doc that empty match are included.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
This means you will need to filter out empty compounds yourself. Use falsiness of the empty string for that.
import re
name = 'propane'
alkanes = re.findall(r'(\d+\W+)*(methyl|ethyl|propyl|butyl)*(meth|eth|prop|but|pent|hex)(ane)', name)
alkanes = [tuple(comp for comp in a if comp) for a in alkanes]
print(alkanes) # [('prop', 'ane')]
Also, avoid using capitalized variable names as those are generally reserved for class names.

Related

How to replace a character within a string in a list?

I have a list that has some elements of type string. Each item in the list has characters that are unwanted and want to be removed. For example, I have the list = ["string1.", "string2."]. The unwanted character is: ".". Therefore, I don't want that character in any element of the list. My desired list should look like list = ["string1", "string2"] Any help? I have to remove some special characters; therefore, the code must be used several times.
hola = ["holamundoh","holah","holish"]
print(hola[0])
print(hola[0][0])
for i in range(0,len(hola),1):
for j in range(0,len(hola[i]),1):
if (hola[i][j] == "h"):
hola[i] = hola[i].translate({ord('h'): None})
print(hola)
However, I have an error in the conditional if: "string index out of range". Any help? thanks
Modifying strings is not efficient in python because strings are immutable. And when you modify them, the indices may become out of range at the end of the day.
list_ = ["string1.", "string2."]
for i, s in enumerate(list_):
l[i] = s.replace('.', '')
Or, without a loop:
list_ = ["string1.", "string2."]
list_ = list(map(lambda s: s.replace('.', ''), list_))
You can define the function for removing an unwanted character.
def remove_unwanted(original, unwanted):
return [x.replace(unwanted, "") for x in original]
Then you can call this function like the following to get the result.
print(remove_unwanted(hola, "."))
Use str.replace for simple replacements:
lst = [s.replace('.', '') for s in lst]
Or use re.sub for more powerful and more complex regular expression-based replacements:
import re
lst = [re.sub(r'[.]', '', s) for s in lst]
Here are a few examples of more complex replacements that you may find useful, e.g., replace everything that is not a word character:
import re
lst = [re.sub(r'[\W]+', '', s) for s in lst]

Possible occurrences of splitting a string by delimiter

I have a string : str = "**Quote_Policy_Generalparty_NameInfo** "
I am splitting the string as str.split("_") which gives me a list in python.
Any help in getting the output as below is appreciated.
[ Quote, Quote_Policy, Quote_Policy_Generalparty, Quote_Policy_Generalparty_NameInfo ]
You can use range(len(list)) to create slices list[:1], list[:2], etc. and then "_".join(...) to concatenate every slice
text = "Quote_Policy_Generalparty_NameInfo "
data = text.split('_')
result = []
for x in range(len(data)):
part = data[:x+1]
part = "_".join(part)
result.append(part)
print(result)
input = "Quote_Policy_Generalparty_NameInfo"
tokenized = input.split("_")
combined = [
"_".join(tokenized[:i])
for i, token in enumerate(tokenized, 1)
]
The value of combined above will be
['Quote', 'Quote_Policy', 'Quote_Policy_Generalparty', 'Quote_Policy_Generalparty_NameInfo']
you could use accumulate from itertools, we basically give it one more argument, which decides how to accumulate two elements
from itertools import accumulate
input = "Quote_Policy_Generalparty_NameInfo"
output = [*accumulate(input.split('_'), lambda str1, str2 : '_'.join([str1,str2])),]
which gives :
Out[11]:
['Quote',
'Quote_Policy',
'Quote_Policy_Generalparty',
'Quote_Policy_Generalparty_NameInfo']
If you find the above answers too clean and satisfactory, you can also consider regular expressions:
>>> import regex as re # For `overlapped` support
>>> x = "Quote_Policy_Generalparty_NameInfo"
>>> list(map(lambda s: s[::-1], re.findall('(?<=_).*$', '_' + x[::-1], overlapped=True)))
['Quote_Policy_Generalparty_NameInfo', 'Quote_Policy_Generalparty', 'Quote_Policy', 'Quote']

split and flatten tuple of tuples

What is the best way to split and flatten the tuple of tuples below?
I have this tuple of tuples:
(('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
I need to:
split by \n\, \r\, \n\, \r\, ",", " "
and flatten it. So the end result should look like this:
['aaaa_BBB_wacker*','cccc', 'aaaa_BBB_tttt*','aaaa_BBB2_wacker','aaaa_BBB','BBB_ffff','aaaa_BBB2MM*','naaaa_BBB_cccc2MM*','BBBMM','BBB2MM BBB','aaaa_BBB_cccc2MM_tttt','aaaa_BBB_tttt', 'aaaa_BBB']
I tried the following and it eventually completes the job but I have to repeat it multiple times for each pattern.
patterns = [[i.split('\\r') for i in patterns]]
patterns = [item for sublist in patterns for item in sublist]
patterns = [item for sublist in patterns for item in sublist]
patterns = [[i.split('\\n') for i in patterns]]
You should use a regexp to split the strings:
import re
re.split(r'[\n\r, ]+', s)
It will be easier using a loop:
patterns = []
for item in l:
patterns += re.split(r'[\n\r, ]+', s)
Given
tups = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',),
('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',),
('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',),
('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
Do
import re
delimiters = ('\r', '\n', ',', ' ', '\\r', '\\n')
pat = '(?:{})+'.format('|'.join(map(re.escape, delimiters)))
result = [s for tup in tups for item in tup for s in re.split(pat, item)]
Notes. Calling re.escape on your delimiters makes sure that they are properly escaped for your regular expression. | makes them alternatives. ?: makes your delimiter group non-capturing so it isn't returned by re.split. + means match the previous group one or more times.
Here is a one-liner.. but it's not simple. You can add as many items as you want in the replace portion, just keep adding them.
start = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
output = [final_item for sublist in start for item in sublist for final_item in item.replace('\\r',' ').replace('\\n',' ').split()]

Remove a pattern in all strings in a list

For example, if I have a list of strings
alist=['a_name1_1', 'a_name1_2', 'a_name1_3']
How do I get this:
alist_changed = ['a_n1_1', 'a_n1_2', 'a_n1_3']
alist_changed = [s.replace("ame", "") for s in alist]
If you are looking for something that actually needs to be "pattern" based then you can use python's re module and sub the regular expression pattern for what you want.
import re
alist=['a_name1_1', 'a_name1_2', 'a_name1_3']
alist_changed = []
pattern = r'_\w*_'
for x in alist:
y = re.sub(pattern, '_n1_', x, 1)
#print(y)
alist_changed.append(y)
print(alist_changed)

Remove strings from a list that contains numbers in python [duplicate]

This question already has answers here:
How to remove all integer values from a list in python
(8 answers)
Closed 6 years ago.
Is there a short way to remove all strings in a list that contains numbers?
For example
my_list = [ 'hello' , 'hi', '4tim', '342' ]
would return
my_list = [ 'hello' , 'hi']
Without regex:
[x for x in my_list if not any(c.isdigit() for c in x)]
I find using isalpha() the most elegant, but it will also remove items that contain other non-alphabetic characters:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”
my_list = [item for item in my_list if item.isalpha()]
I'd use a regex:
import re
my_list = [s for s in my_list if not re.search(r'\d',s)]
In terms of timing, using a regex is significantly faster on your sample data than the isdigit solution. Admittedly, it's slower than isalpha, but the behavior is slightly different with punctuation, whitespace, etc. Since the problem doesn't specify what should happen with those strings, it's not clear which is the best solution.
import re
my_list = [ 'hello' , 'hi', '4tim', '342' 'adn322' ]
def isalpha(mylist):
return [item for item in mylist if item.isalpha()]
def fisalpha(mylist):
return filter(str.isalpha,mylist)
def regex(mylist,myregex = re.compile(r'\d')):
return [s for s in mylist if not myregex.search(s)]
def isdigit(mylist):
return [x for x in mylist if not any(c.isdigit() for c in x)]
import timeit
for func in ('isalpha','fisalpha','regex','isdigit'):
print func,timeit.timeit(func+'(my_list)','from __main__ import my_list,'+func)
Here are my results:
isalpha 1.80665302277
fisalpha 2.09064006805
regex 2.98224401474
isdigit 8.0824341774
Try:
import re
my_list = [x for x in my_list if re.match("^[A-Za-z_-]*$", x)]
Sure, use the string builtin for digits, and test the existence of them.
We'll get a little fancy and just test for truthiness in the list comprehension; if it's returned anything there's digits in the string.
So:
out_list = []
for item in my_list:
if not [ char for char in item if char in string.digits ]:
out_list.append(item)
And yet another slight variation:
>>> import re
>>> filter(re.compile('(?i)[a-z]').match, my_list)
['hello', 'hi']
And put the characters that are valid in your re (such as spaces/punctuation/other)

Categories