Regular Expression Difficulties - python

So this is the link I have to extract:
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark
And this is what I have currently
.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$
The issue, however, is it extracts any number of words and hyphens after the "/article-details/" part, rather than specifically 6 word titles with hyphens replacing the spaces above. So it would accept a bad result
http://www.hrmagazine.co.uk/article-details/finance-sector-dominates-working-families-benchmark-test
When I need it to only accept links like this format
http://www.hrmagazine.co.uk/article-details/one-two-three-four-five-six
What's the correct regular expression for this type of website? The current extractor I have in Scrapy/Spyder is the following
rules = (Rule(LinkExtractor(allow=['.+\/article-details\/.+\-.+\-.+\-.+\-.+\-.+$']), callback='parse_item', follow=True),)

Each of those .+ in your regex can match any number of ANY character - including hyphens. So your overall regex is just requiring a minimum of 5 hyphens, not an exact count. Use [^-]+ to match only non-hyphen characters.
Note that none of those backslashes in your regex are accomplishing anything - in no case is the following character something requiring escaping. Even if they were, you'd need to double the backslashes, or use a raw string r'whatever', so that the backslashes are being interpreted by the re module, rather than Python's string literal parsing rules.

Try replacing the . with something like [a-z]; . will also match hyphens, which is why its matching an unlimited number of words:
.+\/article-details\/[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+\-[a-z]+$
If you need to match things like numbers, add them to the brackets as well ([a-z0-9], etc.).

Related

Regex Match on String (DOI)

Hi I'm struggling to understand why my Regex isn't working.
I have URL's that have DOI's on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural+Resources+Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I'm using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i you can use the optional flags parameter of findall.
Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10, so that has to go. Instead you could require that the 10 must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with \w, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i (re.I).
This leaves us with:
print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w]+',
'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Regular expression match / split

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?
You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo
It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

how to write a regex to match characters separated by 3 slashes?

any string could be inside amongst the slashes, but there must be only 3 divisions. For instance,
Values that should match:
"90/90/90/9090"
"FDSAFDSA/90/pppppaA3/9090"
Values that should not match:
"90/90/90/9090/90"
"FDSAFDSA/90/pppppaA3/9090/90"
I am using python and the re library, I have tried lots of combination but none of them worked:
bool(re.match(r'^.*\/.*\/.*\/((?!\/).)*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/((?!/).)*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/(?!(/)$).*$', "90/90/90/9090/90"))
bool(re.match(r'^.*\/.*\/.*\/(/).*$', "90/90/90/90/90"))
bool(re.match(r'^.*\/.*\/.*\/.*(\/)$', "90/90/90/90/90"))
You ought to use negated character classes :
^[^/]*/[^/]*/[^/]*/[^/]*$
[^/] matches any character but /, so we represent a string that contains three / and anything else around them.
In your regexes, the . could match anything including /, and while you could have approximated an equivalent of negated character classes using negative lookarounds, you didn't properly apply them to every . you had : ^((?!\/).)*\/((?!\/).)*\/((?!\/).)*\/((?!\/).)*$ would have worked too, although it would have been less performant.
And there's no need to escape those /, they aren't regex meta-characters. You might need to escape them in languages or tools that use / as delimiters, such as JavaScript's /pattern/ syntax or sed's s/search/replace/ substitutions.

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

Looking for a regular expression including alphanumeric + "&" and ";"

Here's the problem:
split=re.compile('\\W*')
This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like k&auml;ytt&auml;j&aml;auml;.
What should I add to the regex to include the & and ; characters?
I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:
(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+
This matches
either a word character (including “_”), or
an HTML entity consisting of
the character “&”,
the character “#”,
the character “x” followed by at least one hexadecimal digit, or
at least one decimal digit, or
at least one letter (= named entity),
a semicolon
at least once.
/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.
You probably want to take the problem reverse, i.e. finding all the character without the spaces:
[^ \t\n]*
Or you want to add the extra characters:
[a-zA-Z0-9&;]*
In case you want to match HTML entities, you should try something like:
(\w+|&\w+;)*
you should make a character class that would include the extra characters. For example:
split=re.compile('[\w&;]+')
This should do the trick. For your information
\w (lower case 'w') matches word characters (alphanumeric)
\W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
* matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).
Looks like this RegEx did the trick:
split=re.compile('(\\\W+&\\\W+;)*')
Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.

Categories