© 2014 Fairfax New Zealand Limited<br/>
Privacy<!-- |
The above is the offending section in my HTML document.
Below is my regex. It works on every other URL in my document. Except this one.
urliter = re.finditer(r'(http://|https://)([\w]+\.[\w\.]+\/?)([\w\/\.]+")',lines)
urlMatches = defaultdict(list)
for match in urliter:
urlMatches[match.group(2)].append(match.group())
When I view the output, for some reason, www.fairfaxmedia.co.nz cuts off the z at the end, so it only shows www.fairfaxmedia.co.n for group(2)
I can't figure out why this would be?
Also, question #2 - how would I only search for URLs in quotations, but leave the quotations out of the match?
Your regex uses capturing group:
(http://|https://) matches (and captures in group 1) the http part
([\w]+\.[\w\.]+\/?) captures in the second group
([\w\/\.]+") captures in the third group
Since you put a + in ([\w\/\.]+"), the character class [\w\/\.] cannot match no character. Meaning that in http://www.fairfaxmedia.co.nz" the last group has to match at least z".
Hence, the z cannot be in the second group (which is the one you're calling), illustration here.
If you want to simply separate the domain name from the rest of your URL, you can tweak your regex to:
"(https?://(\w+\.[\w.]+)(/?[\w/.-]*))"
The whole URL (without quotes) is in capturing group 1, the domain name in capturing group 2, the rest in capturing group 3: see demo here.
For searching for text in quotations, but leaving quotations out of the match you can use lookaround assertions.
For example (core regexp taken from Robins answer)
(?<=\")(https?://(\w+\.[\w.]+)(/?[\w\/\.]*))(?=\")
Related
I have the following string,
"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.
I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').
If you want a capture group with all the remaining matches, you have to consume the first ones first:
(TAG.*?){3}(TAG.*?)*
This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:
(?:TAG.*?){3}(TAG.*?)*
Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.
EDIT:
I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:
Ignoring the first 3 occurences in an own (non-)capturing group forces you to abandon the 'g' modifier for finding all occurences (because that would just do 'ignore 3 TAGS, find 1' in a loop).
It is not possible to capture multiple groups with just one capture group. Trying to do that always captures the last occurence. There is a possibility to capture not just the last but all occurences together in a single capture group but it seems like you want them in separate groups.
So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all or g modifier. In python you then can simply take all findings skipping the first 3:
import re
str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"
findings = re.findall(pattern, str)[:3]
If you want to ignore the first character after TAG, just add a . behind TAG:
pattern = r"(?:TAG.((?:(?!TAG).)+))"
Explanation of the regex:
- I use ?: to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more
specific in what we actually want, I've introduced the negative
lookahead after the TAG occurence.
I’m stumped on a problem. I have a large data frame where two of the columns are like this:
pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'], ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:
Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
Test2
I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:
r’^[https]+(:)(//)(.*?)(/)(\d)’
Can anyone tell me how to solve this problem?
Thanks in advance.
Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.
Your current pattern is incorrect because:
^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.
You may use the following expression:
https?:\/\/twitter\.com[^,]+(?<=\/\d$)
https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.
Regex demo here.
Python demo:
import pandas as pd
df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)
Prints:
ID URLs
0 a https://twitter.com/dog_rates/status/890971913173991426/photo/1
1 b https://twitter.com/dog_rates/status/890971913173991426/photo/1
2 c https://twitter.com/dog_rates/status/890971913173991430/video/1
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp
The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.
According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.
The url pattern is
http://www.hepsiburada.com/philips-40pfk5500-40-102-ekran-full-hd-200-hz-uydu-alicili-cift-cekirdek-smart-android-led-tv-p-EVPHI40PFK5500
This website has similar urls. The unique identifier is -p- for this url.
The url pattern always has -p- before word which is at end of url.
I used the following regex
(.*)hepsiburada\.com\/([\w.-]+)([\-p\-\w+])\Z
it matched but it match many patterns on this website.
For example regex should match url above but it shouldnt match with
http://www.hepsiburada.com/bilgisayarlar-c-2147483646
Since you are using a re.match you really need to match the string from the beginning. However, the main problem is that your -p- is inside a character class, and is thus treated as separate symbols that can be matched. Same is with the \w+ - it is considered as \w and + separately.
So, use a sequence:
(.*)hepsiburada\.com/([\w.-]+)(-p-\w+)$
See this regex demo
Or
^https?://(?:www\.)?hepsiburada\.com/([\w.-]+)(-p-\w+)$
See the regex demo
Note that most probably you even have no need in the capture groups, and (...) parentheses can be removed from the pattern.
I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you