Check strings in a for loop for multiple regexs - python

I'm tracing log files for someone and they are a complete mess (no line-breaks and separators). So I did some easy Regex to make the logs tidy. The logging #codes# are now nicely separated in a list and their string attached to it in a sub-dict. It's like this:
Dict [
0 : [LOGCODE_53 : 'The string etc etc']
]
As this was kind of easy I was purposing to directly add some log-recognition to it too. Now I can match the LOGCODE, but the problem is that the codes aren't complaint to anything and often different LOGCODE's contain the same output-strings.
So I wrote a few REGEX matches to detect what the log is about. My question now is; what is wisdom to detect a big variety of string patterns? There might be around 110 different types of strings and they are so different that it's not possible to "super-match" them. How can I run ~110 REGEXes over a string to find out the string's intend and thus index them in a logical register.
So kind of like; "take this $STRING and test all the $REGEXes in this $LIST and let me know which $REGEX(es) (indexes) matches the string".
My code:
import re
# Open, Read-out and close; Log file
f = open('000000df.log', "rb")
text = f.read()
f.close()
matches = re.findall(r'00([a-zA-Z0-9]{2})::((?:(?!00[a-zA-Z0-9]{2}::).)+)', text)
print 'Matches: ' + str(len(matches))
print '=========================================================================================='
for match in matches:
submatching = re.findall(r'(.*?)\'s (.*?) connected (.*?) with ZZZ device (.*?)\.', match[1])
print match[0] + ' >>> ' + match[1]
print match[0] + ' >>> ' + submatching[0][0] + ', ' + submatching[0][1] + ',',
print submatching[0][2] + ', ' + submatching[0][3]

re.match, re.search and re.findall return None if a particular regex doesn't match, so you could just iterate over your possible regular expressions and test them:
tests = [
re.compile(r'...'),
re.compile(r'...'),
re.compile(r'...'),
re.compile(r'...')
]
for test in tests:
matches = test.findall(your_string):
if matches:
print test, 'works'

Related

Regex to exclude words followed by space

I tried a lot of solutions but can't get this Regex to work.
The string-
"Flow Control None"
I want to exclude "Flow Control" plus the blank space, and only return whatever is on the right.
Since you have tagged your question with #python and #regex, I'll outline a simple solution to your problem using these tools. Furthermore, the other two answers don't really tackle the exact problem of matching "whatever is on the right" of your "Flow Control " prefix.
First, start by importing the re builtin module (read the docs).
import re
Define the pattern you want to match. Here, we're matching "whatever is on the right" ((?P<suffix>.+)$) of ^Flow Control .
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
Grab the match for a given string (e.g. "Flow Control None")
suffix = pattern.search("Flow Control None").group("suffix")
print(suffix) # Out: None
Hopefully, this complete working example will also help you
import re
def get_suffix(text: str):
pattern = re.compile(r"^Flow Control (?P<suffix>.+)$")
matches = pattern.search(text)
return matches.group("suffix") if matches else None
examples = [
"Flow Control None",
"Flow Control None None",
"Flow Control None",
"Flow Control ",
]
for example in examples:
suffix = get_suffix(text=example)
if suffix:
print(f"Matched: {repr(suffix)}")
else:
print(f"No matches for: {repr(example)}")
Use split like so:
my_str = 'Flow Control None'
out_str = my_str.split()[-1]
# 'None'
Or use re.findall:
import re
out_str = re.findall(r'^.*\s(\S+)$', my_str)[0]
If you really want a purely regex solution try this: (?<= )[a-zA-Z]*$
The (?<= ) matches a single ' ' but doesn't include it in the match. [a-zA-Z]* matches anything from a to z or A to Z any number of times. $ matches the end of the line.
You could also try replacing the * with a + if you want to ensure that your match has at least one letter (* will produce a 0-length match if your string ends in a space, + will match nothing).
But it may be clearer to do something like
data = "Flow Control None"
split = data.split(' ')
split[len(split) - 1] # returns "None"
EDIT data.split(' ')[-1] also returns "None"
or
data[data.rfind(' ') + 1:] # returns "None"
that don't involve regexes at all.

Python Regular Expression issue r'\/([^]]*)\ '

import re
path = "xyz pqr /all_it_abc type = cell"
s = re.findall(r'\/([^]*)\ ', path)
is giving me error "return _compile(pattern, flags).findall(string"
s= re.findall(r'\/([^]]*)\ ', path)
is giving me output as ['all_it_abc type =']
why? I have to use "]]"
print(s)
Desired output is all_it_abc
Thanks in advance to all and stackoverflow
The regex your are trying to find is is capturing a group [^]] This represents any character except ']'. That's why you are getting undesired output.
Your expected output can be obtained by using this simple regex instead.
s = re.findall(r'\/([a-z_]*)', path)
where s gives you this list as output
['all_it_abc']
Explanation - it captures everything after the slash that has an underscore or alphabet in it. You can add more characters to include them in the captured group.
r'\/([a-z_]*)'
In regex the pattern
pattern = r'[^abcd]'
means that this position can be matched by anything BUT 'a', 'b' ,'c' or 'd'.
pattern = r'[^]'
is an incomplete pattern that could match anything BUT ']' - if you complete it to a valid r[^]] pattern by closing the bracket.
You need r'/([^ ]+):
import re
path = "xyz pqr /all_it_abc type = cell"
s = re.findall(r'/([^ ]+)', path) # find all after / thats not a space
print(type(s), s) # this is a list of all matches
To get a list of all matches:
(<type 'list'>, ['all_it_abc'])

Python - remove parts of a string

I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Python Regular expression must strip whitespace except between quotes

I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result

Categories