Generalize regex to search for Wikipedia Categories - python

I have the following string of text (taken from the Wikipedia dumps)
text = "[[Category:Ethnic groups| ]]\n[[Category:Ethnic groups by region|*]]\n[[Category:Society-related lists|Ethnic groups]]\n[[Category:Lists of ethnic groups]]"
and I would like to extract all the categories in the text. So basically the ideal output should be
text = "[Ethnic groups,Ethnic groups by region,Society-related lists|Ethnic groups,Lists of ethnic groups]"
This is my attempts at getting the solution
import re
categories = re.findall(r'\b(Category:.*)\b', text)
categories = [category.replace("Category:", "") for category in categories]
which returns what I want. However, I'm not sure this is the best way to generalize the regular expression. In particular, I would like to search for "[[Category:" instead of just "Category:" because that's the actual Wikipedia definition for the category links. Do you have any suggestions on how I can improve my regular expression?

First, you don't need to make a research and after a replacement, you can do it in one step using a capture group (re.findall returns only capture groups when the pattern contains capture groups, otherwise it returns the whole match).
Looking for [[Category: instead of \bCategory: is probably a good idea. All you have to do is to escape opening square brackets since they are special regex characters.
Instead of .*\b you should use something more restrictive like (?:\|(?!\*)[^\]|]*)*) that excludes the closing square bracket and the pipe followed by an asterisk. However using .*\b is also a good idea if you are sure that the data you want to extract ends with a word character and if there is only one [[Category:...]] per line. A good compromise will be [^\]]*\b
So in one step:
categories = re.findall(r'\[\[Category:([^\]]*\b)', text)

I would go with :
re.findall(r"\bCategory:(.*)\b", text)
wich should return only the values needed (thanks to the parenthesis)

Related

Use regex to remove a substring that matches a beginning of a substring through the following comma

I haven't found any helpful Regex tools to help me figure this complicated pattern out.
I have the following string:
Myfirstname Mylastname, Department of Mydepartment, Mytitle, The University of Me; 4-1-1, Hong,Bunk, Tokyo 113-8655, Japan E-mail:my.email#example.jp, Tel:00-00-222-1171, Fax:00-00-225-3386
I am trying to learn enough Regex patterns to remove the substrings one at a time:
E-mail:my.email#example.jp
Tel:00-00-222-1171
Fax:00-00-225-3386
So I think the correct pattern would be to remove a given word (ie., "E-mail", "Tel") all the way through the following comma.
Is type of dynamic pattern possible in Regex?
I am performing the match in Python, however, I don't think that would matter too much.
Also, I know the data string looks comma separated, and it is. However there is no guarantee of preserving the order of those fields. That's why I'm trying to use a Regex match.
How about this regex:
<YOUR_WORD>.*?(?=(,|($)))
Explanation:
It looks for the word specified in <YOUR_WORD> placeholder
It looks for any kind of character afterwards
The search stops when it hits one of the two options:
It finds the character ,
It finds an end of the line
So:
E-mail.*?(?=(,|($)))
Will result in:
E-mail:my.email#example.jp
And
Fax.*?(?=(,|($)))
Will result in:
Fax:00-00-225-3386
If there are edge cases it misses - I would like to know, and whether it affects the performance/ is necessary.

Matching data between padding

I'm trying to match some strings in a binary file and the strings appear to be padded. As an example, the word PROGRAM could be in the binary like this:
%$###P^&#!)00000R{]]]////O.......G"""""R;;$#!*%&#*A/////847M
In that example, the word PROGRAM is there but it is split up and it's between random data, so I'm trying to use regex to find it.
Currently, this is what I came up with but I don't think this is very effectie:
(?<=P)(.*?)(?=R)(.*?)(?=O)(.*?)(?=G)(.*?)(?=R)(.*?)(?=A)(.*?)(?=M)
If you want to get PROGRAM from the string, one option might be to use re.sub with a negated character class to remove all that you don't want.
[^A-Z]+
Regex demo | Python demo
For example:
import re
test_str = "%$###P^&#!)00000R{]]]////O.......G\"\"\"\"\"R;;$#!*%&#*A/////847M"
pattern = r'[^A-Z]+'
print(re.sub(pattern, '', test_str))
Result
PROGRAM
This should work for you and is more efficient than your current solution:
P[^R]+R[^O]+O[^G]+G[^R]+R[^A]+A[^M]+M
Explanation:
P[^R]+ - match P, than one or more characters other than R
Demo
I'm not quite sure what the desired output might be, I'm guessing maybe this expression,
(?=.*?P.*?R.*?O.*?G.*?R.*?A.*?M).*?(P).*?(R).*?(O).*?(G).*?(R).*?(A).*?(M)
might be a start.
The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

Using regular expressing in python

I have a couple of huge log files which contains a list of activity names and sub-activities with a numerical value associated with each sub activity. I need to write a script to automate the data analysis process. I used Regex to get a pattern match for my main activity by doing a word by word search.Now, I have to find the sub-activity and get the numerical value associated with it.
For example: "Out: Packet Sizes Histogram Bucket 5=10" I need to check for the sub-activity Out: Packet Sizes and get the Histogram Bucket value 5=10. There are a list of sub-activities like this. In my word search technique I find it hard to get a pattern match for my sub-activity. What regex pattern should i use to get the 5=10 value when the pattern matches the entire text before that?
PS: All the sub-activities has the text "Histogram Bucket" repeated. I would greatly appreciate your suggestions to address this issue. I have just started learning regex and python.
(1) If you want to use one regular expression you could use:
known_activities = ['Out: Packet Sizes']
# you might have to use '\s' or '\ ' to protect the whitespaces.
activity_exprs = [a.replace(' ', '\s') for a in known_activities]
regexpr = r'('+'|'.join(activity_exprs)+r')\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Activity: '+match.group(1))
print('Bucket: '+match.group(2))
(2) If you don't want (or have to) match the activities, it you could also go simply with:
regexpr = r'(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Activity: '+match.group(1))
print('Bucket: '+match.group(2))
(3) If you do want to match activities you can always do so in a separate step:
if match:
activity = match.group(1)
if activity in known_activities:
print('Activity: '+activity )
print('Bucket: '+match.group(2))
EDIT Some more details and explanations:
items = ['a','b','c']
'|'.join(items)
produces a|b|c. Used in regular expressions | denotes alternatives, e.g. r'a(b|c)a' will match either 'aba' or 'aca'. So in (1) I basically chained all known activities as alternatives together. Each activity has to be a valid regular expression in it self (that is why any 'special' characters (e.g. whitespace) should be properly escaped).
One could simply mash together all alternatives by hand into one large regular expression, but that gets unwieldy and error prone fast, if there are more than a couple of activities.
All in all you are probably better of using (2) and if necessary (3) or a separate regular expression as a secondary stage.
EDIT2
regarding your sample line you could also use:
regexpr = r'([^\s]*?)\s([^\s]*?)\s([^\s]*?)\s(.*?)\s*Histogram\sBucket\s(\d+=\d+)'
pattern = re.compile(regexpr)
match = pattern.match(input)
if match:
print('Date: '+match.group(1))
print('Time: '+match.group(2))
print('Activity: '+match.group(3))
print('Sub: '+match.group(4))
print('Bucket: '+match.group(5))
EDIT3
pattern.match(input) expects to find the pattern directly at the beginning of the input string. That means 'a' will match 'a' or 'abc' but not 'ba'. If your pattern does not start at the beginning you have to prepend '.*?' to your regular expression to consume as much arbitrary characters as necessary.
'\s' matches any whitespace character, '[^\s]' matches any character that is NOT whitespace.
If you want to learn more about regular expressions, the python HOWTO on that matter is quite good.

Matching an object and a specific regex with Python

Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.
I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text)
(let's say text="AAAbAAAcAAA")
I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"
Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?
Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?
You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:
([A-Z]{3}).[A-Z]{3}
The exception is m.group(0), which will always contain the entire match.
Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like
([0-9]{3})-([0-9]{3}-[0-9]{4})
then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).
What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.
One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.

Categories