find substring from list - python - python

I have a list with elements I would like to remove from a string:
Example
list = ['345','DEF', 'QWERTY']
my_string = '12345XYZDEFABCQWERTY'
Is there a way to iterate list and find where are the elements in the string? My final objective is to remove those elements from the string (I don't know if is this the proper way, since strings are immutable)

You could use a regex union :
import re
def delete_substrings_from_string(substrings, text):
pattern = re.compile('|'.join(map(re.escape, substrings)))
return re.sub(pattern, '', text)
print(delete_substrings_from_string(['345', 'DEF', 'QWERTY'], '12345XYZDEFABCQWERTY'))
# 12XYZABC
print(delete_substrings_from_string(['AA', 'ZZ'], 'ZAAZ'))
# ZZ
It uses re.escape to avoid interpreting the string content as a literal regex.
It uses only one pass so it should be reasonably fast and it ensures that the second example isn't converted to an empty string.
If you want a faster solution, you could build a Trie-based regex out of your substrings.

Related

how to replace a substring in a list of strings in python?

so I'm using beautifulsoup to crawl a table in a Wikipedia page in which I extract data in a file.
the problem is that I want to remove some of the substrings in the list generated for the columns in the table
here is my code:
soup= bs(result.text,'html.parser')
country_names= soup.find('table', class_= 'wikitable sortable').tbody
rows= country_names.find_all('tr')
columns=[v.text.replace('[a][b][13]\n', '') for v in rows[0].find_all('th')]
print(columns)
all I was able to do is to remove only one substring from the strings in the list using a replace function.
the output before replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
the output after replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area']
so I want to remove all substrings such as '[8][9][10]\n', ' [6][7]\n ', '[6][7][8]\n' and '(2018)[11][12]\n' and so on but I couldn't reach a solution because I'm still new to python and beautifulsoup.
I would suggest you to dive deeper into Regular Expressions:
Use e.g. \[\d+\] as expression for any number of digits inside brackets.
import re
org_string = 'Capital[8][9][10]\n'
pattern = r'\[\d+\]'
mod_string = re.sub(pattern, '', org_string )
# Capital
I think this is the solution you are looking for:
import re
colums = [re.sub('(\[[0-9]+])', '', i).replace('\n', '') for i in rows]
You can use the python re regular expression library for this. The re library has a function re.sub(pattern,replace_string,input_string) that will replace any substring that matches the pattern regular expression.
Something like this:
# make sure to import the re module
import re
columns = [re.sub('(\[[a-zA-Z\d]*\])+\n','',v.text) for v in rows[0].find_all('th')]
Edit: Changed the regular expression pattern
Your desired output requires the use of regular expressions inside a list comprehension:
import re
list_before = ['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
pattern = r'(\(\d+\))*(\[\w+\])*\n?'
list_after = [re.sub(pattern, "", elem).strip() for elem in list_before]
the pattern defines the regular expression pattern that you want to substitute in each string of the list_before. You may want to dig deeper into regular expressions to fully understand it, but in plain English this pattern matches:
0 or more occurrences of "(", followed by 1 or more digits (which are indicated by the special sequence \d), followed by ")"
0 or more occurrences of "[", followed by 1 or more alphanumeric characters (which are indicated by the special sequence \w), followed by "]"
0 or one occurrences of the new line \n
finally, the method re.sub() inside the list comprehension replaces any match with "".
the output is:
['Flag', 'Map', 'English short nameandformal name', 'Local short name(s)andformal name(s)', 'Capital', 'Population', 'Area']

Extract number from a string using a pattern

I have strings like :
's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
And from it I would like to obtain a tuple contain the year value and the month value as first and second element of my tuple.
('2019', '5')
For now I did this :
([elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][0], [elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][1])
It isn't very elegant, how could I do better ?
Use, re.findall along with the given regex pattern:
import re
matches = re.findall(r'(?i)/year=(\d+)/month=(\d+)', string)
Result:
# print(matches)
[('2019', '5')]
Test the regex pattern here.
Perhaps regular expressions could do it. I would use regular expressions to capture the strings 'year=2019' and 'month=5' then return the item at index [-1] by splitting these two with the character '='. Hold on, let me open up my Sublime and try to write actual code which suits your specific case.
import re
search_string = 's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
string1 = re.findall(r'year=\d+', search_string)
string2 = re.findall(r'month=\d+', search_string)
result = (string1[0].split('=')[-1], string2[0].split('=')[-1]) print(result)

find elements of string that ends with specific value

I have a list of strings
['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
I want to use regular expression to get strings that end with a specific number.
As I understand, first I have to combine all strings from the list into large string, then use form some kind of a pattern to use it for regular expression
I would be grateful if you could provide
regex(some_pattern, some_string)
that would return
['time_10', 'date_10']
or just
'time_10, date_10'
str.endswith is enough.
l = ['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
result = [s for s in l if s.endswith('10')]
print(result)
['time_10', 'date_10']
If you insist on using regex,
import re
result = [s for s in l if re.search('10$', s)]

How to extract just the characters "abc-3456" from the given text in python

i have this code
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m
This prints ['abc-3456'] but i want to get only abc-3456 (without the square brackets and the quotes].
How to do this?
import re
text = "this is my desc abc-3456"
m = re.findall("\w+\\-\d+", text)
print m[0]
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
findall returns list of strings. If you want the first one then use m[0].
print m[0] will give string without [] and ''.
If you only want the first (or only) result, do this:
import re
text = "this is my desc abc-3456"
m = re.search("\w+\\-\d+", text)
print m.group()
re.findall retuns a list of matches. In that list the result is a string. You can use re.finditer if you want.
In python, a list's representation is in brackets: [member1, member2, ...].
A string ("somestring") representation is in quotes: 'somestring'.
This means the representation of a list of strings is:
['somestring1', 'somestring2', ...]
So you have a string in a list, the characters you want to remove are a part of python's representation and not a part of the data you have.
To get the string simply take the first element from the list:
mystring = m[0]

Regex for extraction in Python

I have a string like this:
"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".
I would like to get this as an output:
(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))
I haven't been able to find the proper python regex to achieve that.
>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
if you still want number there you'd need to iterate over the output and convert it to the integer with int.
You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:
regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)
For the example this gives:
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]
Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.
[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]
Returns:
[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]
This method works regardless of the number of elements in the {{ }} blocks.
To get the exact output you wrote, you need a regex and a split:
import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))
To get it with the numbers converted, do this:
toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]
Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.
import re
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []
for match in re.finditer('{{.*?}}', s):
# Split on pipe (|) and filter out non-alphanumerics
parts = [filter(str.isalnum, part) for part in match.group().split('|')]
# Convert to int when possible
for index, part in enumerate(parts):
try:
parts[index] = int(part)
except ValueError:
pass
result.append(tuple(parts))
We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.
import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
lst = []
for x in iterable:
try:
lst.append(int(x))
except ValueError:
lst.append(x)
return tuple(lst)
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]
In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.
re.findall() returns a list of groups matched from the pattern.
Finally, a list comprehension splits each string and returns the result as a tuple.
Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.
from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList
LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))
patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))
s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
print tuple(p[0] for p in patt.searchString(s))
Prints:
(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Categories