Find and extract two substrings from string - python

I have some strings (in fact they are lines read from a file). The lines are just copied to some other file, but some of them are "special" and need a different treatment.
These lines have the following syntax:
someText[SUBSTRING1=SUBSTRING2]someMoreText
So, what I want is: When I have a line on which this "mask" can be applied, I want to store SUBSTRING1 and SUBSTRING2 into variables. The braces and the = shall be stripped.
I guess this consists of several tasks:
Decide if a line contains this mask
If yes, get the positions of the substrings
Extract the substrings
I'm sure this is a easy task with regex, however, I'm not used to it. I can write a huge monster function using string manipulation, but I guess this is not the "Python Way" to do this.
Any suggestions on this?

re.search() returns None if it doesn't find a match. \w matches an alphanumeric, + means 1 or more. Parenthesis indicate the capturing groups.
s = """
bla bla
someText[SUBSTRING1=SUBSTRING2]someMoreText"""
results = {}
for line_num, line in enumerate(s.split('\n')):
m = re.search(r'\[(\w+)=(\w+)\]', line)
if m:
results.update({line_num: {'first': m.group(0), 'second': m.group(1)}})
print(results)

^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$
You can try this.Group 1and Group 2 has the two string you want.See demo.
https://regex101.com/r/pT4tM5/26
import re
p = re.compile(r'^[^\[\]]*\[([^\]\[=]*)=([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
test_str = "someText[SUBSTRING1=SUBSTRING2]someMoreText\nsomeText[SUBSTRING1=SUBSTRING2someMoreText\nsomeText[SUBSTRING1=SUBSTRING2]someMoreText"
re.findall(p, test_str)

Related

Find something that does not match a pattern at the beginning of line

I am using regex in Python to find something at the beginning of line that does not match pattern "SCENE" and before colon. The text looks like this
SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.
I need to find AQW, CER, DYU, EOI in this case.
I have tried
findall(r"^(?!SCENE)[^:]*, text, re.M)
I get AQW and EOI, but I get ddd\nDYU instead of DYU, ddd\nd\nEOI instead of EOI.
How could I get exactly AQW,CER,DYU,EOI?
You can try this to brteak your string to substrings and try to find there:
import re
line = "SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n."
lines = re.split("\\n([A-Z])", line)
lines = [a+b for a,b in zip(lines[1::2], lines[2::2])]
for line in lines:
if re.match(r"^(?!SCENE)[^:]*", line):
print(line.split(":")[0])
result is:
AQW
CER
DYU
EOI
This answer is not the best in terms of performance I assume
This could probably be simplified more, and I'm assuming that \n in your example string is a literal newline character.
This should match all of your use cases. It starts by looking for any number of characters that aren't SCENE preceding a : then it finds any characters after the colon that don't follow a newline and precede a : then the last . there is probably a way around, but the final character wasn't being properly matched because it was directly followed by the negative lookahead.
findall( r"([A-Z]+(?<!SCENE):(?:[\s\S](?!\n[A-Z]+:))+.)", text )
https://regex101.com/r/NwdUcR/2
EDIT: I realize the above may not be exactly what you're looking for. If you're looking to match just the letters before the colon you can use this:
findall( r"([A-Z]+(?<!SCENE)):", text )
I use
findall (r"\n(?!SCENE)(.+?):")
which works. The point is I did not realize that I can use parenthesis to select what I would like to display in the result.
You don't really need regex in this case.
Here is a solution using plain and simple str.split().
s = 'SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.'
matches = [m.split('\n')[-1] for m in s.split(':') if 'SCENE' not in m]
>>> matches
['AQW', 'CER', 'DYU', 'EOI', '.']
If you want to exclude the last '.', you can use matches = [m.split('\n')[-1] for m in s.split(':') if (('SCENE' not in m) and (m[-1] != '.'))]
or simply matches = matches[:-1]

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Removing variable length characters from a string in python

I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")
Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.
No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

python regular expression replace

I'm trying to change a string that contains substrings such as
the</span></p>
<p><span class=font7>currency
to
the currency
At the line break is CRLF
The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'
I tried:
p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)
but this isn't working
How should I fix this?
Use a lookahead assertion.
p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)
I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.
So, first line of your code would become :
p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)
(not the two unescaped dots instead of the linebreak).
Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.
Hope it helps.
This :
result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)
Applied to :
the</span></p>
<p><span class=font7>currency
Produces :
the currency
Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

Categories