I have problem related to case insensitive search for regular expression. Here is part of the code that I wrote:
engType = 'XM665'
The value of engType was extracted from other files. Based on the engType, I want to find lines in another text file which contain this part and extract description infomation from that line, the description part will be between the engType string and 'Serial'.
for instance:
lines = ['xxxxxxxxxxx','mmmmmmmmmmm','jjjjj','xM665 Module 01 Serial (10-11)']
pat = re.compile(engType+'(.*?)[Ss][Ee][Rr][Ii][Aa][Ll]')
for line in lines:
des = pat.search(line).strip()
if des:
break;
print des.group(1).strip()
I know the result will be an error, since the case of my string engType is different from what it is in 'xM665 Module 01 Serial (10-11)', I understand that I can use [Ss] to do the case insensitive comparisons as What I have done in the last part of pat. However, since my engType is a variable, I could not apply that on a variable. I knew I could search in lower case like:
lines = ['xxxxxxxxxxx','mmmmmmmmmmm','jjjjj','xM665 Module 01 Serial (10-11)']
pat = re.compile(engType.lower()+'(.*?)serial')
for line in lines:
des = pat.search(line.lower()).strip()
if des:
break;
print des.group(1).strip()
result:
module 01
The case is now different compared to Module 01. If I want to keep the case, how can i do this? Thank you!
re.IGNORECASE is the flag you're looking for.
pat = re.compile(engType+'(.*?)[Ss][Ee][Rr][Ii][Aa][Ll]',re.IGNORECASE)
Or, more simply re.compile(engType+'(.*?)serial',re.IGNORECASE).
also, bug in this line:
des = pat.search(line.lower()).strip()
Remove the .strip(); if pat.search() is None you will get an AttributeError.
Check out re.IGNORECASE in http://docs.python.org/3/library/re.html
I believe it'll look like:
pat = re.compile(engType.lower()+'(.*?)serial', re.IGNORECASE)
Related
I got a set of lines in a file that's separated by semicolons like this:
8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;
What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf.
I got this:
(^.*000;)
which according to regex should get me the whole line until 10=000;. Which is great. But if I do this:
(^.*000;)(7202=.*;)
according to the regex101.com means I won't match anything.
I don't know why adding that 2nd grouping invalidates the whole expression.
any help on this would be great.
Thanks
Answer for first version of question
"I am trying to use regex with python to lift out my data from 7202=, so I want to get the asdf:asdf."
If I understand correctly, your goal is to find the data that is between 7202= and ;. In that case:
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> re.search('7202=([^;]*);', line).group(1)
'asdf:asdf'
The regex is 7202=([^;]*);. This matches:
The literal string 7202=
Any characters that follow up to but excluding the firs semicolon:
([^;]*). Because this is in parentheses, it is captured as group 1.
The literal character ;
Answer for second version of question
"What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf."
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> r = re.search('.*7202=([^;]*);.*10=000;', line)
>>> r.group(0), r.group(1)
('8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;', 'asdf:asdf')
The regex is .*7202=([^;]*);.*10=000;. This matches:
Anything up to and including 7202=: .*7202=
Any characters that follow up to but excluding the firs semicolon: ([^;]*). Because this is in parentheses, it is captured as group 1.
Any characters that follow starting with ; and ending with 10=000;: ;.*10=000;
The value of the whole match string is available as r.group(0). The value of group 1 is available as r.group(1). Thus the single match object r lets us get both strings.
I am programming a parser for an old dictionary and I'm trying to find a pattern like re.findall("{.*}", string) in a string.
A control print after the check proves, that only a few strings match, although all strings contain a pattern like {...}.
Even copying the string and matching it interactively in the idle shell
gives a match, but inside the rest of the code, it simply does not.
Is it possible that this problem is caused by the actual python interpreter?
I cannot figure out any other problem...
thanks for your help
the code snippet looks like that:
for aParse in chunklist:
aSigle = aParse[1]
aParse = aParse[0]
print("to be parsed", aParse)
aContext = Context()
aContext._init_("")
aContext.ID = contextID
aContext.source = aSigle
# here, aParse is the string containing {Abriss}
# which is part of a lexicon entry
metamatches = re.findall("\{.*\}", aParse)
print("metamatches: ", metamatches)
for meta in metamatches:
aMeta = meta.replace("{", "").replace("}", "")
aMeta = aMeta.split()
for elem in aMeta:
...
Try this:
re = {0: "{.test1}",1: "{.test1}",2: "{.test1}",3: "{.test1}"}
for value in re.itervalues():
if "{" in value:
value = value.replace("{"," ")
print value
or if you want to remove both "{}”
for value in re.itervalues():
if "{" in value:
value = value.strip('{}')
print value
Try this
data=re.findall(r"\{([^\}]*)}",aParse,re.I|re.S)
DEMO
So, in a really simplified scenario, a lexical entry looks like that:
"headword" {meta, meaning} context [reference for context].
So, I was chunking (split()) the entry at [...] with a regex. that works fine so far. then, after separating the headword, I tried to find the meta/meaning with a regex that finds all patterns of the form {...}. Since that regex didn't work, I replaced it with this function:
def findMeta(self, string, alist):
opened = 0
closed = 0
for char in enumerate(string):
if char[1] == "{":
opened = char[0]
elif char[1] == "}":
closed = char[0]
meta = string[opened:closed+1]
alist.append(meta)
string.replace(meta, "")
Now, its effectively much faster and the meaning component is correctly analysed. The remaining question is: in how far are the regex which I use to find other information (e.g. orthographic variants, introduced by "s.}") reliable? should they work or is it possible that the IDLE shell is simply not capable of parsing a 1000 line program correctly (and compiling all regex)? an example for a string whose meta should actually have been found is: " {stm.} {der abbruch thut, den armen das gebührende vorenthält} [Renn.]"
the algorithm finds the first, saying this word is a noun, but the second, it's translation, is not recognized.
... This is medieval German, sorry for that! Thank you for all your help.
Assume I have a string which includes some data fields that are separated by "|", like
|1|2|3|4|5|6|7|8|
My purpose is to get the 8th field. This is what I'm doing:
pattern = re.compile(r'^\s+(\|.*?\|){8}')
match = pattern.match(test_line)
if match:
print:match.group(8)
But looks like it can not match. I know in this case I need to use ? for non-greedy match, but why I can not get the 8th field?
Thanks
Regex might be complicating this problem rather than simplifying it. A simple way to get an eighth item from a | delimited string is using split():
a = '|here|is|some|data|separated|by|bars|hooray!|'
print a.split('|')[8]
RETURNS
hooray!
Using regex, one way to get it would be:
import re
a = '|here|is|some|data|separated|by|bars|hooray!|'
pattern = re.compile(r'([^\|]+)')
match = pattern.findall(a)
print match[7]
RETURNS
hooray!
I have text similar to the following:
==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here.
In the above text the MainSection 1 and 2 have different names which can be everything the user wants. Same goes for the subsections.
What i want to do with a regex is get the text of a mainsection including its subsection (if there is one).
Yes this is from a wikipage. All mainsections names start with == and end with ==
All subsections have more then the 2== in there name.
regex =re.compile('==(.*)==([^=]*)', re.MULTILINE)
regex.findall(text)
But the above returns each separate section.
Meaning it perfectly returns a mainsection but sees a subsection on his own.
I hope someone can help me with this as its been bugging me for some time
edit:
The result should be:
[('Mainsection1', 'Some text here\n===Subsection1.1===
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1===
Other text goes here.\n')]
Edit 2:
I have rewritten my code to not use a regex. I came to the conclusion that it's easy enough to just parse it myself. Which makes it a bit more readable for me.
So here is my code:
def createTokensFromText(text):
sections = []
cur_section = None
cur_lines = []
for line in text.split('\n'):
line = line.strip()
if line.startswith('==') and not line.startswith('==='):
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
cur_lines = []
cur_section = line
continue
if cur_section:
cur_lines.append(line)
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
return sections
Thanks everyone for the help!
All the answers provided have helped me a lot!
First, it should be known, I know a little about Python, but I have never programmed formally in it... Codepad said this works, so here goes! :D -- Sorry the expression is so complex:
(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
This does what you asked for, I believe! on Codepad, this code:
import re
wikiText = """==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here. """
outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText)
print outputArray
Produces this result:
[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]
EDIT: Broken down, the expression essentially says:
01 (?<!=) # First, look behind to assert that there is not an equals sign
02 == # Match two equals signs
03 ([^=]+) # Capture one or more characters that are not an equals sign
04 == # Match two equals signs
05 (?!=) # Then verify that there are no equals signs following this
06 ( # Start a capturing group
07 [\s\S]*? # Match zero or more of ANY character (even CrLf), but BE LAZY
08 (?= # Look ahead to verify that either...
09 $ # this is the end of the
10 | # -OR-
11 (?<!=) # when I look behind there is no equals sign
12 == # then there are two equals signs
13 [^=]+ # then one or more characters that are not equals signs
14 == # then two equals signs
15 (?!=) # then verify that there are no equals signs following this
16 ) # End look-ahead group
17 ) # End capturing group
Line 03 and Line 06 specify the capturing groups for the Main Section Title and the Main Section Content, respectively.
Line 07 begs for a lot of explanation if you're not pretty fluent in Regex...
The \s and \S inside a character class [] will match anything that is whitespace or is not whitespace (i.e. ANYTHING WHATSOEVER) - one alternative to this is using the . operator, but depending upon your compiler options (or ability to specify options) this might or might not match CrLf (or Carriage-Return/Line-Feed). Since you want to match multiple lines, this is the easiest way to ensure a match.
The *? at the end means that it will match zero or more instances of the "anything" character class, but BE LAZY ABOUT IT - "lazy" quantifiers (sometimes called "reluctant") are the opposite of the default "greedy" quantifier (without the ? following it), and will not consume a source character unless the source that follows it cannot be matched by the part of the expression that follows the lazy quantifier. In other words, this will consume any characters until it finds either the end of the source text OR another main section which is specified by exactly two and only two equals signs on either side of one or more characters that are not an equals sign (including whitespace). Without the lazy operator, it would try to consume the entire source text then "backtrack" until it could match one of the things after it in the expression (end of source or a section header)
Line 08 is a "look-ahead" that specifies that the expression follwoing should be ABLE to be matched, but should not be consumed.
END EDIT
AFAIK, it has to be this complex in order to properly exclude the subsections... If you want to match the Section Name and Section Content into named groups, you can try this:
(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
If you'd like, I can break it down for you! Just ask! EDIT (see edit above) END EDIT
The problem here is that ==(.*)== matches ==(=Subsection=)==, so the first thing to do is to make sure there is no = inside the title : ==([^=]*)==([^=]*).
We then need to make sure that there is no = before the beginning of the match, otherwise, the first = of the three is ignored and the subtitle is matched. This will do the trick : (?<!=)==([^=]*)==([^=]*), it means "Matches if not preceded by ...".
We can also do this at the end to make sure, which gives as a final result (?<!=)==([^=]*)==(?!=)([^=]*).
>>> re.findall('(?<!=)==([^=]*)==(?!=)([^=]*)', x,re.MULTILINE)
[('Mainsection1', '\nSome text here\n'),
('Mainsection2', '\nText goes here\n')]
You could also remove the check at the end of the title and replace it with a newline. That may be better if you are sure there is a new line at the end of each title.
>>> re.findall('(?<!=)==([^=]*)==\n([^=]*)', x,re.MULTILINE)
[('Mainsection1', 'Some text here\n'), ('Mainsection2', 'Text goes here\n')]
EDIT :
section = re.compile(r"(?<!=)==([^=]*)==(?!=)")
result = []
mo = section.search(x)
previous_end = 0
previous_section = None
while mo is not None:
start = mo.start()
if previous_section:
result.append((previous_section, x[previous_end:start]))
previous_section = mo.group(0)
previous_end = mo.end()
mo = section.search(x, previous_end)
result.append((previous_section, x[previous_end:]))
print result
It's more simple than it looks : repeatedly, we search for a section title after the previous one, and we add it to the result with the text between the beginning of this title and the end of the previous one. Adjust it to suit your style and your needs. The result is :
[('==Mainsection1==',
' \nSome text here \n===Subsection1.1=== \nOther text here \n\n'),
('==Mainsection2==',
' \nText goes here \n===Subsecttion2.1=== \nOther text goes here. ')]
I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']