Removing markup links in text - python

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this:
[the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.
Here is another example:
[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)
I'd like to keep: the podcast list.
How can I do this with Python's re library? What is the appropriate regex?

I have created an initial attempt at your requested regex:
(?<=\[.+\])\(.+\)
The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.
You can extend the above regex to look for only things that have weblinks in the brackets, like so:
(?<=\[.+\])\(https?:\/\/.+\)
The problem with this is that if the link they provide is not started with an http or https it will fail.
After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.
Edit 1:
Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:
\[(.+)\]\(.+\)
You can then substitute the first captured group (in the square brackets) back in using:
re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)
If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

Related

Generalize regex to search for Wikipedia Categories

I have the following string of text (taken from the Wikipedia dumps)
text = "[[Category:Ethnic groups| ]]\n[[Category:Ethnic groups by region|*]]\n[[Category:Society-related lists|Ethnic groups]]\n[[Category:Lists of ethnic groups]]"
and I would like to extract all the categories in the text. So basically the ideal output should be
text = "[Ethnic groups,Ethnic groups by region,Society-related lists|Ethnic groups,Lists of ethnic groups]"
This is my attempts at getting the solution
import re
categories = re.findall(r'\b(Category:.*)\b', text)
categories = [category.replace("Category:", "") for category in categories]
which returns what I want. However, I'm not sure this is the best way to generalize the regular expression. In particular, I would like to search for "[[Category:" instead of just "Category:" because that's the actual Wikipedia definition for the category links. Do you have any suggestions on how I can improve my regular expression?
First, you don't need to make a research and after a replacement, you can do it in one step using a capture group (re.findall returns only capture groups when the pattern contains capture groups, otherwise it returns the whole match).
Looking for [[Category: instead of \bCategory: is probably a good idea. All you have to do is to escape opening square brackets since they are special regex characters.
Instead of .*\b you should use something more restrictive like (?:\|(?!\*)[^\]|]*)*) that excludes the closing square bracket and the pipe followed by an asterisk. However using .*\b is also a good idea if you are sure that the data you want to extract ends with a word character and if there is only one [[Category:...]] per line. A good compromise will be [^\]]*\b
So in one step:
categories = re.findall(r'\[\[Category:([^\]]*\b)', text)
I would go with :
re.findall(r"\bCategory:(.*)\b", text)
wich should return only the values needed (thanks to the parenthesis)

regex to extract mentions in Twitter

I need to write a regex in python to extract mentions from Tweets.
My attempt:
regex=re.compile(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9]+)")
It works fine for any mention like #mickey
However, in mentions with underscores like #mickey_mouse, it only extracts #mickey.
How can I modify the regex for it to work in both cases?
Thank you
Add an underscore to the last set like this:
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)
Regex101 Demo
On a side note, Twitter Handle rules allow you to have usernames starting with numbers & underscores as well. So to extract twitter handles a regex could be as simple as: #\w{1,15} (allows characters, numbers and underscores and includes the 15 character limit). Will need some additional lookaheads/lookbehinds based on where the regex might be used.
A shorter version, including the negative cases from #degant:
(?<=#)\w+

Extracting parenthesis with a specific format with Python

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file.
I have tried this with regular expressions, but please let me know if their is a better method.
To show what I want to do by example:
s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"
From this string I want a result something like:
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
The regular expression I have tried so far is
"(\(.+[,] [0-9]{4}\))"
in conjunction with re.findall(), however this only gives me the result:
['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']
So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.
Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.
Using [^()] instead of .. This will make sure there is no nested ().
>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.
I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:
Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

Regex matching very slow

I am trying to parse a PDF to extract the text from it (please don't suggest any libraries to do this, as this is part of learning the format).
I have already handled deflating it to put it in the alphanumeric format. I now need to extract the text from the text blocks.
So, my current pattern is BT.*?\((.*?)\).*?ET (with DOTMATCHALL set) to match something like:
BT
/F13 12 Tf
288 720 Td
(ABC) Tj
ET
The only bit I want is the text ABC in the brackets.
The above is only formatted like that to make it clear to see. In the deflated text it may be all in one line, it may not be. There is no gurantee that the BT/ET will be at the start of a line. There may be spaces and text before/after the bracketed section, there may not be. There will however, be only one bracketed section per BT/ET block.
The above pattern works, but is really slow, I assume it is because the regex library is failing to match the pattern that matches the text between BT and the (ABC) many times.
The regex is pre-compiled in an attempt to speed it up, but it seems negligible.
How may I speed this up?
How many of these blocks might appear in a document?
Often slow Regex execution is the result of catastrophic backtracking, as described here: http://www.regular-expressions.info/catastrophic.html
I don't know what regex technology you're using, but you could try to use lookaround assertions, as described here:
http://www.regular-expressions.info/lookaround.html
These allow you to first just match what you want, ABC inside parentheses, and then validate that it is preceded by some value and followed by some other value.
Are you sure the regex is correct and pulls out ABC as a match? What language's regex engine is this? Using my regular expression debugger shows that:
"BT.*?((.*?)).*?ET" doesn't pull out ABC and in fact must find the string 'ET' then backtrack back to find everything else.
"BT.*?\\((.*?)\\).*?ET" works as expected with a single pass left to right.
here's one without regex. simple string parsing using Python internals.
>>> xtract="""
... BT
... /F13 12 Tf
... 288 720 Td
... (ABC) Tj
... ET
...
... """
>>> for chunk in xtract.split("ET"):
... if "BT" in chunk:
... for brace in chunk.split(")"):
... if "(" in brace:
... print brace[brace.find("(")+1:]
...
ABC
You can't just parse the PDF with a regex to extract the text. In most cases the text in inside compressed binary blobs or encoded. A PDF with the text shown like this is very much the exception.
There's not really enough info for a definite answer--or maybe you're assuming we know more about PDF than you do. Are there always parenthesized chunks inside these BT...ET sections? Is there always only one of them? Is the BT or ET always at the beginning of a line? If so, I would suggest
(?m)^BT[^()]*\((.*?)\)[^()]*?^ET
If I knew how PDF represented literal parentheses, I could probably come up with something more efficient.
EDIT: According to the PDF spec, literal parentheses have to be escaped with a backslash, and there are a bunch of other backslash-escape sequences. So try this:
(?s)\bBT\b[^()]*\(((?:[^()\\]*(?:\\.[^()\\]*)*))\)
This part--[^()\\]*(?:\\.[^()\\]*)*--matches a block of text which may contain escaped characters (including parens), but not unescaped parens. I know it looks ugly, but it's the most efficient way, since Python doesn't support atomic groups or possessive quantifiers.
(?s) allows . to match newlines, and \bBT\b makes sure the BT isn't part of a longer "word". I'm reasonably confident that this is all I need to match all of the actual text content, so I don't bother matching the stuff after the closing paren.
Since there will be only one bracketed expression between a BT and an ET, you could try the following regular expression for speed:
r"(?s)\bBT\b[^(]*\(([^)]*)\).*?\bET\b"

Categories