Extract number from a string using a pattern - python

I have strings like :
's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
And from it I would like to obtain a tuple contain the year value and the month value as first and second element of my tuple.
('2019', '5')
For now I did this :
([elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][0], [elem.split('=')[-1:][0] for elem in part[0].split('/')[-2:]][1])
It isn't very elegant, how could I do better ?

Use, re.findall along with the given regex pattern:
import re
matches = re.findall(r'(?i)/year=(\d+)/month=(\d+)', string)
Result:
# print(matches)
[('2019', '5')]
Test the regex pattern here.

Perhaps regular expressions could do it. I would use regular expressions to capture the strings 'year=2019' and 'month=5' then return the item at index [-1] by splitting these two with the character '='. Hold on, let me open up my Sublime and try to write actual code which suits your specific case.
import re
search_string = 's3://bukcet_name/tables/name=moonlight/land/timestamp=2020-06-25 01:00:23.180745/year=2019/month=5'
string1 = re.findall(r'year=\d+', search_string)
string2 = re.findall(r'month=\d+', search_string)
result = (string1[0].split('=')[-1], string2[0].split('=')[-1]) print(result)

Related

how to replace a substring in a list of strings in python?

so I'm using beautifulsoup to crawl a table in a Wikipedia page in which I extract data in a file.
the problem is that I want to remove some of the substrings in the list generated for the columns in the table
here is my code:
soup= bs(result.text,'html.parser')
country_names= soup.find('table', class_= 'wikitable sortable').tbody
rows= country_names.find_all('tr')
columns=[v.text.replace('[a][b][13]\n', '') for v in rows[0].find_all('th')]
print(columns)
all I was able to do is to remove only one substring from the strings in the list using a replace function.
the output before replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
the output after replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area']
so I want to remove all substrings such as '[8][9][10]\n', ' [6][7]\n ', '[6][7][8]\n' and '(2018)[11][12]\n' and so on but I couldn't reach a solution because I'm still new to python and beautifulsoup.
I would suggest you to dive deeper into Regular Expressions:
Use e.g. \[\d+\] as expression for any number of digits inside brackets.
import re
org_string = 'Capital[8][9][10]\n'
pattern = r'\[\d+\]'
mod_string = re.sub(pattern, '', org_string )
# Capital
I think this is the solution you are looking for:
import re
colums = [re.sub('(\[[0-9]+])', '', i).replace('\n', '') for i in rows]
You can use the python re regular expression library for this. The re library has a function re.sub(pattern,replace_string,input_string) that will replace any substring that matches the pattern regular expression.
Something like this:
# make sure to import the re module
import re
columns = [re.sub('(\[[a-zA-Z\d]*\])+\n','',v.text) for v in rows[0].find_all('th')]
Edit: Changed the regular expression pattern
Your desired output requires the use of regular expressions inside a list comprehension:
import re
list_before = ['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
pattern = r'(\(\d+\))*(\[\w+\])*\n?'
list_after = [re.sub(pattern, "", elem).strip() for elem in list_before]
the pattern defines the regular expression pattern that you want to substitute in each string of the list_before. You may want to dig deeper into regular expressions to fully understand it, but in plain English this pattern matches:
0 or more occurrences of "(", followed by 1 or more digits (which are indicated by the special sequence \d), followed by ")"
0 or more occurrences of "[", followed by 1 or more alphanumeric characters (which are indicated by the special sequence \w), followed by "]"
0 or one occurrences of the new line \n
finally, the method re.sub() inside the list comprehension replaces any match with "".
the output is:
['Flag', 'Map', 'English short nameandformal name', 'Local short name(s)andformal name(s)', 'Capital', 'Population', 'Area']

Filtering a list of strings using regex

I have a list of strings that looks like this,
strlist = [
'list/category/22',
'list/category/22561',
'list/category/3361b',
'list/category/22?=1512',
'list/category/216?=591jf1!',
'list/other/1671',
'list/1y9jj9/1yj32y',
'list/category/91121/91251',
'list/category/0027',
]
I want to use regex to find the strings in this list, that contain the following string /list/category/ followed by an integer of any length, but that's it, it cannot contain any letters or symbols after that.
So in my example, the output should look like this
list/category/22
list/category/22561
list/category/0027
I used the following code:
newlist = []
for i in strlist:
if re.match('list/category/[0-9]+[0-9]',i):
newlist.append(i)
print(i)
but this is my output:
list/category/22
list/category/22561
list/category/3361b
list/category/22?=1512
list/category/216?=591jf1!
list/category/91121/91251
list/category/0027
How do I fix my regex? And also is there a way to do this in one line using a filter or match command instead of a for loop?
You can try the below regex:
^list\/category\/\d+$
Explanation of the above regex:
^ - Represents the start of the given test String.
\d+ - Matches digits that occur one or more times.
$ - Matches the end of the test string. This is the part your regex missed.
Demo of the above regex in here.
IMPLEMENTATION IN PYTHON
import re
pattern = re.compile(r"^list\/category\/\d+$", re.MULTILINE)
match = pattern.findall("list/category/22\n"
"list/category/22561\n"
"list/category/3361b\n"
"list/category/22?=1512\n"
"list/category/216?=591jf1!\n"
"list/other/1671\n"
"list/1y9jj9/1yj32y\n"
"list/category/91121/91251\n"
"list/category/0027")
print (match)
You can find the sample run of the above implementation here.

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

find elements of string that ends with specific value

I have a list of strings
['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
I want to use regular expression to get strings that end with a specific number.
As I understand, first I have to combine all strings from the list into large string, then use form some kind of a pattern to use it for regular expression
I would be grateful if you could provide
regex(some_pattern, some_string)
that would return
['time_10', 'date_10']
or just
'time_10, date_10'
str.endswith is enough.
l = ['time_10', 'time_23', 'time_345', 'date_10', 'date_23', 'date_345']
result = [s for s in l if s.endswith('10')]
print(result)
['time_10', 'date_10']
If you insist on using regex,
import re
result = [s for s in l if re.search('10$', s)]

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

Categories