Python regex extracting substrings containing numbers and letters - python

I am attempting to extract a substring that contains numbers and letters:
string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
I only want 11m56.95s and 13m31.14s
I have tried doing this:
re.findall('\d+', string)
that doesn't give me what I want, I also tried this:
re.findall('\d{2}[m]+\d[.]+\d|\+)
that did not work either, any other suggestions?

Try this:
re.findall("[0-9]{2}[m][0-9]{2}\.[0-9]{2}[s]", string)
Output:
['11m56.95s', '13m31.14s']

Your current regular expression does not match what you expect it to.
You could use the following regular expression to extract those substrings.
re.findall(r'\d+m\d+\.\d+s', string)
Live Demo
Example:
>>> import re
>>> s = 'LINE : 11m56.95s CPU 13m31.14s TODAY'
>>> for x in re.findall(r'\d+m\d+\.\d+s', s):
... print x
11m56.95s
13m31.14s

Your Regex pattern is not formed correctly. It is currently matching:
\d{2} # Two digits
[m]+ # One or more m characters
\d # A digit
[.]+ # One or more . characters
\d|\+ # A digit or +
Instead, you should use:
>>> import re
>>> string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
>>> re.findall('\d+m\d+\.\d+s', string)
['11m56.95s', '13m31.14s']
>>>
Below is an explanation of what the new pattern matches:
\d+ # One or more digits
m # m
\d+ # One or more digits
\. # .
\d+ # One or more digits
s # s

\b #word boundary
\d+ #starts with digit
.*? #anything (non-greedy so its the smallest possible match)
s #ends with s
\b #word boundary

If your lines are all like your example split will work:
s = "LINE : 11m56.95s CPU 13m31.14s TODAY"
spl = s.split()
a,b = spl[2],spl[4]
print(a,b)
('11m56.95s', '13m31.14s')

Related

How do I remove a string that starts with '#' and ends with a blank character by using regular expressions in Python?

So I have this text:
"#Natalija What a wonderful day, isn't it #Kristina123 ?"
I tried to remove these two substrings that start with the character '#' by using re.sub function but it didn't work.
How do I remove the susbstring that starts with this character?
Try this regex :
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
t = re.sub('#.*? ', '', text)
print(t)
OUTPUT :
What a wonderful day, isn't it ?
This should work.
# matches the character #
\w+ matches any word character as many times as possible, so it stops at blank character
Code:
import re
regex = r"#\w+"
subst = "XXX"
test_str = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
print (result)
output:
XXX What a wonderful day, isn't it XXX ?
It's possible to do it with re.sub(), it would be something like this:
import re
text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
output = re.sub('#[a-zA-Z0-9]+\s','',text)
print(output) # Output: What a wonderful day, isn't it ?
# matches the # character
[a-zA-Z0-9] matches alphanumerical (uppercase and lowercase)
"+" means "one or more" (otherwise it would match only one of those characters)
\s matches whitespaces
Alternatively, this can also be done without using the module re. You can first split the sentence into words. Then remove the words containing the # character and finally join the words into a new sentence.
if __name__ == '__main__':
original_text = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
individual_words = original_text.split(' ')
words_without_tags = [word for word in individual_words if '#' not in word]
new_sentence = ' '.join(words_without_tags)
print(new_sentence)
I think this would be work for you. The pattern #\w+?\s will determine expressions which start with # continued by one or more alphanumeric characters then finish with an optional white space.
import re
string = "#Natalija What a wonderful day, isn't it #Kristina123 ?"
pattern = '#\w+?\s'
replaced = re.sub(pattern, '', string)
print(replaced)

About how to find all desired format in a str

I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed

Find a match word and printing letters by their sides

I am trying to find a word on a string, match it with a query word, and then print them with some of their neighboring letters, like this:
input = aaxxYYxxaa
match = YY
requested_output = xxYYxx
So far I have tried with the Regex module, but I cannot go beyond the ‘match’ part:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r"YY", teststring)
print(word)
output = YY
What could I do here to print the letters on each end of the ‘YY’ word?.
Thank you.
It looks as if you want to match any 0 to 2 chars before and after the YY value. Add .{0,2} on both sides of the pattern:
re.findall(r".{0,2}YY.{0,2}", teststring)
See the regex demo and a Python demo:
import re
teststring = "aaxxYYxxaa"
word = re.findall (r".{0,2}YY.{0,2}", teststring)
print(word) # => ['xxYYxx']
You would write your regex in a way, that it matches arbitrary characters before and after you known search term.
. matches any character
{m,n} repeats at least m times and at most n times
so to match xxYYxx you would say .{2,2}YY.{2,2}

regex and python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?
You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.
regex seems a bit of an overkill:
mystring.replace("T"," ")
I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)
That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

How to replace part of string via regex with saving part of pattern?

For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.
You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit
You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'

Categories