Can't make regex work with Python - python

I need to extract the date in format of: dd Month yyyy (20 August 2013).
I tried the following regex:
\d{2} (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}
It works with regex testers (chcked with several the text - Monday, 19 August 2013), but It seems that Python doesn't understand it. The output I get is:
>>>
['August']
>>>
Can somebody please understand me why is that happening ?
Thank you !

Did you use re.findall? By default, if there's at least one capture group in the pattern, re.findall will return only the captured parts of the expression.
You can avoid this by removing every capture group, causing re.findall to return the entire match:
\d{2} (?:January|February|...|December) \d{4}
or by making a single big capture group:
(\d{2} (?:January|February|...|December) \d{4})
or, possibly more conveniently, by making every component a capture group:
(\d{2}) (January|February|...|December) (\d{4})
This latter form is more useful if you will need to process the individual day/month/year components.

It looks like you are only getting the data from the capture group, try this:
(\d{2} (?:January|February|March|April|May|June|July|August|September|October|November|December) \d{4})
I put a capture group around the entire thing and made the month a non-capture group. Now whatever was giving you "August" should give you the entire thing.
I just looked at some python regex stuff here
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
Seeing this, I'm guessing (since you didn't show how you were actually using this regex) that you were doing group(1) which will now work with the regex I supplied above.
It also looks like you could have used group(0) to get the whole thing (if I am correct in the assumption that this is what you were doing). This would work in your original regex as well as my modified version.

Related

Python: Extracting URLs using regex or other means

I’m stumped on a problem. I have a large data frame where two of the columns are like this:
pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'], ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:
Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
Test2
I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:
r’^[https]+(:)(//)(.*?)(/)(\d)’
Can anyone tell me how to solve this problem?
Thanks in advance.
Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.
Your current pattern is incorrect because:
^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.
You may use the following expression:
https?:\/\/twitter\.com[^,]+(?<=\/\d$)
https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.
Regex demo here.
Python demo:
import pandas as pd
df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)
Prints:
ID URLs
0 a https://twitter.com/dog_rates/status/890971913173991426/photo/1
1 b https://twitter.com/dog_rates/status/890971913173991426/photo/1
2 c https://twitter.com/dog_rates/status/890971913173991430/video/1

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

Using python regular expression to match times

I'm trying to parse a csv file with times in the form of 6:30pm or 7am, or midnight. I've googled around and read the docs for regular expressions in the python docs but haven't been able to implement them successfully.
My first try to match them was:
re.findall(r'^d{1,2}(:d{1,2})?$', string)
But this didn't work. I have the parenthesis and the question mark there because sometimes there isn't always anything more than the hour. Also, I haven't even begun to think about how to match the am and pm.
Any help is appreciated!
First of all, to match digits you need \d, not just d.
re.findall(r'^\d{1,2}(:\d{1,2})?$', string)
Second, as written, your regex will only match a string which is exactly a single time and nothing else, because ^ means "beginning of string" and $ means "end of string. You can omit those if you want to find all of the times throughout the string:
re.findall(r'\d{1,2}(:\d{1,2})?', string)
As far as the am/pm goes, you can just add another optional group:
re.findall(r'\d{1,2}(:\d{1,2})?(am|pm)?', string)
Of course, because everything is optional besides the first 1 or 2 digits, you're also going to match any one or two digit number. You could instead require either at least either am/pm or a colon and two more digits:
re.findall(r'\d{1,2}((am|pm)|(:\d{1,2})(am|pm)?)', string)
But, findall behaves slightly oddly: if you have matching groups in your pattern, it'll only return the groups rather than the full match. Thus, you can change them to non-matching groups:
re.findall(r'\d{1,2}(?:(?:am|pm)|(?::\d{1,2})(?:am|pm)?)', string)
If you are strictly looking for a regex solution. You can use:
re.findall(r'^\d{1,2}(:\d{1,2})?$', string)
But wait
that's not all. There is a better way to do it without regex ;). You can use python CSV parsing powers.
import csv
string = "November,Monday,6:30pm,1989"
csv_reader = csv.reader( [ string ] )
for row in csv_reader:
print row
Output
['November', 'Monday', '6:30pm', '1989']
import re
regex = r'(\d{1,2})([.:](\d{1,2}))?[ ]?(am|pm)?'
groups = re.findall(regex, value)
group1 will give hr
group3 will give min
group4 will give am/pm
Examples :
12pm
12.30pm
12:30pm
2.30 am
all these examples are working

Unexpected end of Pattern : Python Regex

When I use the following python regex to perform the functionality described below, I get the error Unexpected end of Pattern.
Regex:
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
Purpose of this regex:
INPUT:
CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
Should match:
CODE876
CODE223
CODE657
CODE697
and replace occurrences with
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
Should Not match:
code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665
FINAL OUTPUT
http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665
EDIT and UPDATE 1
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
The error is no more happening. But this does not match any of the patterns as needed. Is there a problem with matching groups or the matching itself. Because when I compile this regex as such, I get no match to my input.
EDIT AND UPDATE 2
f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()
s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
print s1
INPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
OUTPUT
CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345
CODE234
CODE333
Regex works for Raw input, but not for string input from a text file.
See Input 4 and 5 for more results http://ideone.com/3w1E3
Your main problem is the (?-i) thingy which is wishful thinking as far as Python 2.7 and 3.2 are concerned. For more details, see below.
import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy
Looks like suggestions fall on deaf ears ... Here's the pattern in re.VERBOSE format:
pattern4 = r'''
^
(?i)
(
(?:
(?!http://)
(?!testing[0-9])
(?!example[0-9])
. #### what is this for?
)*?
) ##### end of capturing group 1
(CODE[0-9]{3}) #### not in capturing group 1
(?!</a>)
'''
Okay, it looks like the problem is the (?-i), which is surprising. The purpose of the inline-modifier syntax is to let you apply modifiers to selected portions of the regex. At least, that's how they work in most flavors. In Python it seems they always modify the whole regex, same as the external flags (re.I, re.M, etc.). The alternative (?i:xyz) syntax doesn't work either.
On a side note, I don't see any reason to use three separate lookaheads, as you did here:
(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?
Just OR them together:
(?:(?!http://|testing[0-9]|example[0-9]).)*?
EDIT: We seem to have moved from the question of why the regex throws exceptions, to the question of why it doesn't work. I'm not sure I understand your requirements, but the regex and replacement string below return the results you want.
s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)',
r'\g<1>\g<2>', s)
see it in action one ideone.com
Is that what you're after?
EDIT: We now know that the replacements are being done within a larger text, not on standalone strings. That's makes the problem much more difficult, but we also know the full URLs (the ones that start with http://) only occur in already-existing anchor elements. That means we can split the regex into two alternatives: one to match complete <a>...</a> elements, and one to match our the target strings.
(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))
The trick is to use a function instead of a static string for the replacement. Whenever the regex matches an anchor element, the function will find it in group(1) and return it unchanged. Otherwise, it uses group(2) and group(3) to build a new one.
here's another demo (I know that's horrible code, but I'm too tired right now to learn a more pythonic way.)
The only problem I see is that you replace using the wrong capturing group.
modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'\g<1>',input)
^ ^ ^
first capturing group second one using the first group
Here I made the first one also a non capturing group
^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)
See it here on Regexr
For complex regexes, use the re.X flag to document what you're doing and to make sure the brackets match up correctly (i.e. by using indentation to indicate the current level of nesting).

How to use ? and ?: and : in REGEX for Python?

I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
As Manu already said, ? means "zero or one time". It is the same as {0,1}.
And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.
?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!
? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()
? does not mean "zero or more", it means "zero or one".

Categories