I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?
Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.
Related
I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?
You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo
It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.
Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.
First of all, regex needs to be working for both the python and PCRE(PHP). I'm trying to ignore if a regex pattern is followed by the letter 'x' to distinguish dimensions from strings like "number/number" in the given example below:
dummy word 222/2334; Ø14 x Ø6,33/523,23 x 2311 mm
From here, I'm trying to extract 222/2334 but not the 6,33/523,23 since that part is actually part of dimensions. So far I came up with this regex
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))(?=\s?x)
which can extract what I don't want it to extract and it looks like this. If I change the positive lookahead to negative it captures both of them except the last '3' from 6,33/523,23. It looks like this. How can I only capture 222/2334? What am I doing wrong here?
Desired output:
222/2334
What I got
222/2334 6,33/523,2
You may use this simplified regex with negative lookahead:
((\d*(?:,?\.?)\d*(?:,?\.?))\s?\/\s?(\d*(?:,?\.?)\d*(?:,?\.?)))\b(?![.,]?\d|\s?x)
Updated RegEx Demo
It is important to use a word boundary in the end to avoid matching partial numbers (the reason of your regex matching till a digit before)
Also include [.,]?\d in negative lookahead condition so that match doesn't end at position before last comma.
This shorter (and more efficient) regex may also work for OP:
(\d+(?:[,.]\d+)*)\s*\/\s*(\d+(?:[,.]\d+)*)\b(?![.,]?\d|\s?x)
RegEx Demo 2
There are two easy options.
The first option is ugly and long, but basically negates a positive match on the string that is followed by x, then matches the patterns without it.
(?!PATTERN(?=x))PATTERN
See regex in use here
(?!\d+(?:[,.]\d+)?\s?\/\s?\d+(?:[,.]\d+)?(?=\s?x))(\d+(?:[,.]\d+)?)\s?\/\s?(\d+(?:[,.]\d+)?)
The second option uses possessive quantifiers, but you'll have to use the regex module instead of re in python.
See regex in use here
(\d+(?:[,.]\d+)?+)\s?\/\s?(\d+(?:[,.]\d+)?+)(?!\s?x)
Additionally, I changed your subpattern to \d+(?:[,.]\d+)?. This will match one or more digits, then optionally match . or , followed by one or more digits.
I want a regex that doesn't match a string if contains the word page, and match if it's not contain.
^https?.+/(event|news)/.+(?!page).+$ this is the regex I'm currently using, so I want it to not match with, e.g. https://www.foosite.com/news/foopath/page/10, but it does. Where did I made a mistake?
The double .+ expressions should imply that there should be some string around the page string, and (?!page) should imply there must not be a string like page between them. What's wrong with this expression? Thanks, and sorry for poor grammar.
Your problem is that .+(?!page).+ will match foopath/page/10 because the first .+ match can end at the 1 in 10, and the second can match from there until $. Instead, just assert there is no combination of characters plus the word page after (event|news)/:
^https?.+/(event|news)/(?!.*page)
Demo on regex101
If you want more than just a match/nomatch decision, you can capture the entire matching string with this regex:
^https?.+/(event|news)/(?!.*page).*$
Demo on regex101
You might be looking for
^https?.+/(event|news)/(?:(?!page).)+$
See a demo on regex101.com.
Matching is usually way easier in regex than excluding.
I would rather match your excluded words and invert the logic on the if-clause.
if(!re.match(...
I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you