Regex python expression - python

I'm trying to read a binary file.
My objective is to find all the matches of "10, 10, [any hex value exactly one time], either EE or DD]"
Thought I could do it like this:
pattern = (b"\x10\x10\[0-9a-fA-F]?\[xDD|xEE]")
Clearly not working. It seems that it becomes an error at the third part. I tried dissecting the statement and x10 and x11 works, but the rest just won't.
My understanding of "[0-9a-fA-F]?" is that it matches the range in the brackets 0 or 1 times. and the third part "xDD or xEE" am I wrong?
Any ideas?

Use the regex
b'\x10\x10.[\xdd\xee]'
A single . matches any character (any one-byte) single time, and a single [ab] matches a or b a single time.
>>> re.match(b'\x10\x10.[\xdd\xee]', b'\x10\x10\x00\xee')
<_sre.SRE_Match object; span=(0, 4), match=b'\x10\x10\x00\xee'>

Related

How to create a regular expression that would find all pieces of text BETWEEN certain sets of characters?

I have a string that looks like 'E10 1/05/03 2/3211 3/AO Yuzhmor'.
The pieces that i need to extract are the ones following ' \d\/':
1) 05/03
2) 3211
3) AO Yuzhmor
My last idea was ' \d\/(.*?)(?=(( \d\/)|\Z))'
but it still wouldn't work properly on the last piece (the |\Z instruction doesn't seem to do anything).
I think you're close. This works for your example:
>>> s = 'E10 1/05/03 2/3211 3/AO Yuzhmor'
>>> re.findall('\s\d\/(.*?)(?=\s\d\/|$)', s)
['05/03', '3211', 'AO Yuzhmor']
Explanation:
Match on [space][digit]/, capturing everything that follows using a non-greedy quantifier, until the current position is immediately before either another [space][digit]/ (detected using a lookahead, matched but not consumed) or the end of the input. Use findall to return all matching instances in the input.
This can be tricky because we don't know all of the rules of how these strings are built. One option is to use your regex to split the string
>>> re.split(r" \d/", 'E10 1/05/03 2/3211 3/AO Yuzhmor')[1:]
['05/03', '3211', 'AO Yuzhmor']
Another is to be more specific about the fields, assuming that they are always " 1/", " 2/" and " 3/"
>>> re.match(r".*?1/(.*?) 2/(.*?) 3/(.*)", 'E10 1/05/03 2/3211 3/AO Yuzhmor').groups()
('05/03', '3211', 'AO Yuzhmor')
Try
re.findall('\d/(\S+)', s)
:)

How to improve the performance of this regular expression?

Consider the regular expression
^(?:\s*(?:[\%\#].*)?\n)*\s*function\s
It is intended to match Octave/MATLAB script files that start with a function definition.
However, the performance of this regular expression is incredibly slow, and I'm not entirely sure why. For example, if I try evaluating it in Python,
>>> import re, time
>>> r = re.compile(r"^(?:\s*(?:[\%\#].*)?\n)*\s*function\s")
>>> t0=time.time(); r.match("\n"*15); print(time.time()-t0)
0.0178489685059
>>> t0=time.time(); r.match("\n"*20); print(time.time()-t0)
0.532235860825
>>> t0=time.time(); r.match("\n"*25); print(time.time()-t0)
17.1298530102
In English, that last line is saying that my regular expression takes 17 seconds to evaluate on a simple string containing 25 newline characters!
What is it about my regex that is making it so slow, and what could I do to fix it?
EDIT: To clarify, I would like my regex to match the following string containing comments:
# Hello world
function abc
including any amount of whitespace, but not
x = 10
function abc
because then the string does not start with "function". Note that comments can start with either "%" or with "#".
Replace your \s with [\t\f ] so they don't catch newlines. This should only be done by the whole non-capturing group (?:[\t\f ]*(?:[\%\#].*)?\n).
The problem is that you have three greedy consumers that all match '\n' (\s*, (...\n)* and again \s*).
In your last timing example, they will try out all strings a, b and c (one for each consumer) that make up 25*'\n' or any substring d it begins with, say e is what is ignored, then d+e == 25*'\n'.
Now find all combinations of a, b, c and e so that a+b+c+e == d+e == 25*'\n' considering also the empty string for one or more variables. It's too late for me to do the maths right now but I bet the number is huge :D
By the way regex101 is a great site to try out regular expressions. They automatically break up expressions and explain their parts and they even provide a debugger.
To speedup you can use this regex:
p = re.compile(r"^\s*function\s", re.MULTILINE)
Since you're not actually capturing lines starting with # or % anyway, you can use MULTILINE mode and start matching from the same line where function keyword is found.

Python Regex expression

Trying to write a Regex expression in Python to match strings.
I want to match input that starts as first, first?21313 but not first.
So basically, I don't want to match to anything that has . the period character.
I've tried word.startswith(('first[^.]?+')) but that doesn't work. I've also tried word.startswith(('first.?+')) but that hasn't worked either. Pretty stumped here
import re
def check(word):
regexp = re.compile('^first([^\..])+$')
return regexp.match(word)
And if you dont want the dot:
^first([^..])+$
(first + allcharacter except dot and first cant be alone).
You really don't need regex for this at all.
word.startswith('first') and word.find('.') == -1
But if you really want to take the regex route:
>>> import re
>>> re.match(r'first[^.]*$', 'first')
<_sre.SRE_Match object; span=(0, 5), match='first'>
>>> re.match(r'first[^.]*$', 'first.') is None
True

Regular expression in python to capture multiple forms of badly formatted addresses

I have been tweaking a regular expression over several days to try to capture, with a single definition, several cases of inconsistent format in the address field of a database.
I am new to Python and regular expressions, and have gotten great feedback here is stackoverflow, and with my new knowledge, I built a RegEx that is getting close to the final result, but still can't spot the problem.
import re
r1 = r"([\w\s+]+),?\s*\(?([\w\s+\\/]+)\)?\s*\(?([\w\s+\\/]+)\)?"
match1 = re.match(r1, 'caracas, venezuela')
match2 = re.match(r1, 'caracas (venezuela)')
match3 = re.match(r1, 'caracas, (venezuela) (df)')
group1 = match1.groups()
group2 = match2.groups()
group3 = match3.groups()
print group1
print group2
print group3
This thing should return 'caracas, venezuela' for groups 1 and 2, and 'caracas, venezuela, df' for group 3, instead, it returns:
('caracas', 'venezuel' 'a')
('caracas ', 'venezuel' 'a')
('caracas', 'venezuela', 'df')
The only perfect match is group 3. The other 2 are isolating the 'a' at the end, and the 2nd one has an extra space at the end of 'caracas '.
Thanks in advance for any insight.
Cheers!
Regular expressions might be overkill... what exactly is your problem statement? What do you need to capture?
Some things I caught (in order of appearance in your regex; sometimes it helps to read it out, left-to-right, English-style):
([\w\s+]+)
This says, "capture one or more (letter or one or more spaces)"
Do you really want to capture the spaces at the end of the city name? Also, you don't need (indeed, shouldn't have) the 1-or-more symbol + inside your brackets [ ], since your regex will already be matching one or more of them based on the outer +. I'd rewrite this part like this:
([\w\s]*\w)
Which will match eagerly up to the last alphanumeric character ("zero or more (letter or space) followed by a letter"). This does assume you have at least one character, but is better than your assumption that a single space would work as well.
Next you have:
,?\s*\(?
which looks okay to me except that it doesn't guarantee that you'll see either a comma or an open paren anymore. What about:
(?:,\s*\(|,\s*|\s*\()
which says, "non-capturingly match either (a comma with maybe some spaces and then an open paren) OR (a comma with maybe some spaces) OR (maybe some spaces and then an open paren)". This enforces that you must have either a comma or a paren or both.
Next you have the capturing expression, very similar to the first:
([\w\s+\\/]+)
Again, you don't want the spaces (or slashes in this case) at the end of the city name, and you don't want the + inside the [ ]:
([\w\s\\/]*\w)
The next expression is probably where you're getting your venezuel a problem; let's take a look:
\)?\s*\(?([\w\s+\\/]+)\)?
This is a rather long one, so let's break it down:
\)?\s*\(?
says to "maybe match a close paren, and then maybe some spaces, and then maybe an open paren". This is okay I guess, let's move on to the real problem:
([\w\s+\\/]+)
This capturing group MUST match at least one character. If the matcher sees "venezuela" at the end of your address, it will eagerly match the characters venezuel and then need to satisfy this final expression with what it has left, a. Try instead:
\)?\s*
Followed by making your entire final expression optional, and the outer expression non-capturing:
(?:\(?([\w\s+\\/]+)\)?)?
The final expression would be:
([\w\s]*\w)(?:,\s*\(|,\s*|\s*\()([\w\s\\/]*\w)\)?\s*(?:\(?([\w\s+\\/]+)\)?)?
Edit: fixed a problem that made the final group capture twice, once with the parens, once without. Now it should only capture the text inside the parens.
Testing it on your examples:
>>> re.match(r, 'caracas, venezuela').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas (venezuela)').groups()
('caracas', 'venezuela', None)
>>> re.match(r, 'caracas, (venezuela) (df)').groups()
('caracas', 'venezuela', 'df')
Could you not just find all the words in the text?
E.g.:
>>> import re
>>> samples = ['caracas, venezuela','caracas (venezuela)','caracas, (venezuela) (df)']
>>>
>>> def find_words(text):
... return re.findall('\w+',text)
...
>>> for sample in samples:
... print find_words(sample)
...
['caracas', 'venezuela']
['caracas', 'venezuela']
['caracas', 'venezuela', 'df']

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Categories