Working half the time to replace 1 space or nothing - python

I am trying to write a Python 3.6.0 script to find elements in a page. It extracts the line after words that appear in 2 formats : "Element:" Or "Element :" (with a space before the ":").
So I tried to use regular expressions. It works only half the time and I could not figure out what is wrong in my code. Here is the code with an example:
import re
TestString = r"""Some text
Year: 2015.12.10
Some other text
"""
ListOfTags = ["Year(?= ?):", "Year(?=\s?):", "Year(?= *):"]
for i in range(0, len(ListOfTags)):
try:
TagsFound = re.search(str(ListOfTags[i]) + '(.+?)\n', TestString).group(1)
print(TransformString('"' + ListOfTags[i] + '"') + " returns: " + TagsFound)
except AttributeError:
# TestString not found in the original string (or something else ???)
TagsFound = ''
print("No tag found..")
(With this code, I could test several expressions at a time)
Here, when the expression is "Year: 2015.12.10" all the regular expressions work and return " 2015.12.10"
But, they don't work when it is "Year :" (with a space before the ":")...
I also tried the expressions "Year( ?):", "Year(\s?):", "Year( *):" , "Year( |:?)( |:?)" but they did not work.

I think regular expressions may be overkill here (unless you have a good reason for using them). You could try processing your text line by line. For each line you could use the partition method on the str to split it at the first colon found.
for line in TestString.splitlines():
if ':' in line:
tag, __, value = line.partition(':')
#Now see if this is a tag you care about and do something with the value

Related

regex in python - how to understand this ip lable without parentheses

I have this code to check if a string is a valid IPv4 address:
import re
def is_ip4(IP):
label = "([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it works fine. but if I remove the parentheses from the label, as this
import re
def is_ip4(IP):
label = "[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]"
pattern = re.compile("(" + label + "\.){3}" + label + "$")
if pattern.match(IP):
print("matched!")
else:
print("No!")
it show valid ip for "2090.1.11.0", "20.1.11.0", but not for "2.1.11.0". I'm actually a bit confused for the cases with vs without parentheses. Can someone explain this for me? thanks
The reason you need the parentheses is because of the two-step process you're using. By itself, the parentheses don't do anything (other than capturing in a group). But you're also doing this:
pattern = re.compile("(" + label + "\.){3}" + label + "$")
The label regex is copied twice, first for three repetitions followed by a period. That copy is fine (almost), because in the statement, it is enclosed in parentheses once more. However, the second copy is outside any parentheses, so you end up with a regex like (simplified):
pattern == '(a|ab|abc\.){3}a|ab|abc$'
This matches if either (a|ab|abc\.){3}a matches, or ab or abc. With parentheses, it would be like:
pattern == '((a|ab|abc)\.){3}(a|ab|abc)$'
So, although the parentheses appear superfluous, they are not for two reasons. They are keeping the period separate from the last option abc and they are keeping the final choices together and apart from the first bit.
However, you shouldn't be doing this in the first place. Just use:
from ipaddress import ip_address
def is_ip4(ip):
try:
ip_address(ip)
return True
except ValueError:
return False
No installation required, it's a standard library.
The reason you get a match for '2090.1.11.0' is because matching it to this:
'([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$'
Comes down to matching it to this:
'([0-9]){3}[0-9]'
Since, [0-9] is the first option in the 'or' expression in parentheses, repeated three times and the second [0-9] is the first option in the 'or' expression after the {3}.
Note that the $ you put in to ensure the entire string was matches is lumped in with the final 'or' option, so that doesn't do anything here.
Try running the below and note the identical first match:
import re
print(re.findall('([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.){3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]$', '2090.1.11.0'))
print(re.findall('([0-9]){3}[0-9]', '2090.1.11.0'))
(ignore the second match on the first line, not as relevant)

Python re.sub() optimization

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):
Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n
I want these to be corrected to:
Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n
Easy enough to do with 4 re.sub():
with open ("path/to/file",'r') as fileset:
dataset = fileset.readlines()
for item in dataset:
dataset = [item.strip() for item in dataset] #removes some misc. white noise
item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)
While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?
Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline (\n).
Thank you,
Brief
Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):
f = open("path/to/file",'r')
while True:
x = f.readline()
if not x: break
print re.sub(r, repl, x)
See Python - How to use regexp on file, line by line, in Python for other alternatives.
Code
For viewing sake I've changed your file to an array.
See regex in use here
^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?
Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?
Usage
See code in use here
import re
a = [
"Mr: Smith",
"Mr: Smith; John",
"Smith",
"Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"
def repl(m):
return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")
for s in a:
print re.sub(r, repl, s)
Explanation
^ Assert position at the start of the line
(?:([^:]+):\W*)? Optionally match the following
([^:]+) Capture any character except : one or more times into capture group 1
: Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
([^;]+) Group any character except ; one or more times into capture group 2
(?:;\W*(.+))? Optionally match the following
; Match this literally
\W* Match any number of non-word characters (copied from OP's original code, I assume \s* can be used instead)
(.+) Capture any character one or more times into capture group 3
Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:
repl is a callback to the repl function which returns:
group 1 if it captured anything, title otherwise
group 2 (it's supposedly always set - using OP's logic here again)
group 3 if it captured anything, fname otherwise
IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition).
First, split your item string in "records", like that:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']
Then, you can create a short function to normalize each "record".
Here is an example:
def normalize_record(record):
# type: (str) -> str
name, _, fname = record.partition(';')
title, _, name = name.rpartition(':')
title = title.strip() or 'title'
name = name.strip()
fname = fname.strip() or 'fname'
return "{0},{1},{2}".format(title, name, fname)
This function is easier to understand than a collection of RegEx. And, in most case, it is faster.
For a better integration, you can define another function to handle each item:
def normalize(row):
records = row.splitlines()
return "\n".join(normalize_record(record) for record in records) + "\n"
Demo:
item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)
You get:
'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

python regex - no match in script, although it should

I am programming a parser for an old dictionary and I'm trying to find a pattern like re.findall("{.*}", string) in a string.
A control print after the check proves, that only a few strings match, although all strings contain a pattern like {...}.
Even copying the string and matching it interactively in the idle shell
gives a match, but inside the rest of the code, it simply does not.
Is it possible that this problem is caused by the actual python interpreter?
I cannot figure out any other problem...
thanks for your help
the code snippet looks like that:
for aParse in chunklist:
aSigle = aParse[1]
aParse = aParse[0]
print("to be parsed", aParse)
aContext = Context()
aContext._init_("")
aContext.ID = contextID
aContext.source = aSigle
# here, aParse is the string containing {Abriss}
# which is part of a lexicon entry
metamatches = re.findall("\{.*\}", aParse)
print("metamatches: ", metamatches)
for meta in metamatches:
aMeta = meta.replace("{", "").replace("}", "")
aMeta = aMeta.split()
for elem in aMeta:
...
Try this:
re = {0: "{.test1}",1: "{.test1}",2: "{.test1}",3: "{.test1}"}
for value in re.itervalues():
if "{" in value:
value = value.replace("{"," ")
print value
or if you want to remove both "{}”
for value in re.itervalues():
if "{" in value:
value = value.strip('{}')
print value
Try this
data=re.findall(r"\{([^\}]*)}",aParse,re.I|re.S)
DEMO
So, in a really simplified scenario, a lexical entry looks like that:
"headword" {meta, meaning} context [reference for context].
So, I was chunking (split()) the entry at [...] with a regex. that works fine so far. then, after separating the headword, I tried to find the meta/meaning with a regex that finds all patterns of the form {...}. Since that regex didn't work, I replaced it with this function:
def findMeta(self, string, alist):
opened = 0
closed = 0
for char in enumerate(string):
if char[1] == "{":
opened = char[0]
elif char[1] == "}":
closed = char[0]
meta = string[opened:closed+1]
alist.append(meta)
string.replace(meta, "")
Now, its effectively much faster and the meaning component is correctly analysed. The remaining question is: in how far are the regex which I use to find other information (e.g. orthographic variants, introduced by "s.}") reliable? should they work or is it possible that the IDLE shell is simply not capable of parsing a 1000 line program correctly (and compiling all regex)? an example for a string whose meta should actually have been found is: " {stm.} {der abbruch thut, den armen das gebührende vorenthält} [Renn.]"
the algorithm finds the first, saying this word is a noun, but the second, it's translation, is not recognized.
... This is medieval German, sorry for that! Thank you for all your help.

Pyparsing: Detect tokens with a specific ending

I wonder what I am doing wrong here. Maybe someone can give me a hint on this problem.
I want to detect certain tokens using pyparsing that terminate with the string _Init.
As an example, I have the following lines stored in text
one
two_Init
threeInit
four_foo_Init
five_foo_bar_Init
I want to extract the following lines:
two_Init
four_foo_Init
five_foo_bar_Init
Currently, I have reduced my problem to the following lines:
import pyparsing as pp
ident = pp.Word(pp.alphas, pp.alphanums + "_")
ident_init = pp.Combine(ident + pp.Literal("_Init"))
for detected, s, e in ident_init.scanString(text):
print detected
Using this code there are no results. If I remove the "_" in the Word statement then I can detect at least the lines having a _Init at their ends. But the result isnt complete:
['two_Init']
['foo_Init']
['bar_Init']
Has someone any ideas what I am doing completely wrong here?
The problem is that you want to accept '_' as long as it is not the '_' in the terminating '_Init'. Here are two pyparsing solutions, one is more "pure" pyparsing, the other just says the heck with it and uses an embedded regex.
samples = """\
one
two_Init
threeInit
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init"""
from pyparsing import Combine, OneOrMore, Word, alphas, alphanums, Literal, WordEnd, Regex
# implement explicit lookahead: allow '_' as part of your Combined OneOrMore,
# as long as it is not followed by "Init" and the end of the word
option1 = Combine(OneOrMore(Word(alphas,alphanums) |
'_' + ~(Literal("Init")+WordEnd()))
+ "_Init")
# sometimes regular expressions and their implicit lookahead/backtracking do
# make things easier
option2 = Regex(r'\b[a-zA-Z_][a-zA-Z0-9_]*_Init\b')
for expr in (option1, option2):
print '\n'.join(t[0] for t in expr.searchString(samples))
print
Both options print:
two_Init
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories