regex to extract mentions in Twitter

regex to extract mentions in Twitter - python

I need to write a regex in python to extract mentions from Tweets.
My attempt:
regex=re.compile(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9]+)")
It works fine for any mention like #mickey
However, in mentions with underscores like #mickey_mouse, it only extracts #mickey.
How can I modify the regex for it to work in both cases?
Thank you

Add an underscore to the last set like this:
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)
Regex101 Demo
On a side note, Twitter Handle rules allow you to have usernames starting with numbers & underscores as well. So to extract twitter handles a regex could be as simple as: #\w{1,15} (allows characters, numbers and underscores and includes the 15 character limit). Will need some additional lookaheads/lookbehinds based on where the regex might be used.

A shorter version, including the negative cases from #degant:
(?<=#)\w+

Related

Removing markup links in text

I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this:
[the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.
Here is another example:
[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)
I'd like to keep: the podcast list.
How can I do this with Python's re library? What is the appropriate regex?

I have created an initial attempt at your requested regex:
(?<=\[.+\])\(.+\)
The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.
You can extend the above regex to look for only things that have weblinks in the brackets, like so:
(?<=\[.+\])\(https?:\/\/.+\)
The problem with this is that if the link they provide is not started with an http or https it will fail.
After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.
Edit 1:
Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:
\[(.+)\]\(.+\)
You can then substitute the first captured group (in the square brackets) back in using:
re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)
If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).

Stripping non printable characters from a string in python?

So currently I am trying to find out how many times a specific word appears on a page.
My Python code has this:
print(len(re.findall(secondAnswer, page)))
0
Upon careful analysis, I noticed that
print(secondAnswer) is giving me a different answer "Pacific"
from print(ascii(secondAnswer)) 'Paci\ufb01c'
I have a feeling that my secondAnswer value in len(re.findall(secondAnswer, page)) is using 'Paci\ufb01c' instead and thus not finding any matches on the page.
Can someone give me any tips on how to solve this?
Thanks, Nick

Unicode character fb01 is the ﬁ ligature. That is, it's a single character as far as Python is concerned, but appears as two (tied) characters when displayed.
To decompose ligatures into their separate characters, you can use unicodedata.normalize. For example:
page = unicodedata.normalize("NFKD", page)
Or in this specific case, you could write your regex to accept the ligature as an alternate for the fi character sequence, for example by using alternation with a non-capturing group: paci(?:fi|ﬁ)c.

Regex to identify Reddit usernames

I am making a bot with a option to not post if the username is not a certain user.
Reddit usernames can contain letters in both cases, and have numbers.
Which regex can be used to identify such a username? The format is /u/USERNAME where username can have letters of both cases and numbers, such as ExaMp13.
I have tried /u/[A-Z][a-z][0-9]

Valid characters for Reddit usernames are preceded by /u/ and include:
UPPERCASE
lowercase
Digits
Underscore
Hyphen
This regex meets those criteria:
/u/[A-Za-z0-9_-]+

Brief
Thanks for updating your post with something you've tried as this gives us an idea of what you may not be understanding (and helps us explain where you went wrong and how to fix it).
Your regex doesn't work because it checks for [A-Z] followed by [a-z], then by [0-9]. So your regex will only match something like Be1
Answer
What you should instead try for is [a-zA-Z0-9] or \w and specifying a quantifier such as + (one or more).
For your specific problem, you should use \/u\/(\w+) (or /u/(\w+) since python doesn't care about escaping). This will allow you to then check the first capture group against a list of users you want to not post for.
These regular expressions will ensure that it matches /u/ followed by any word character [a-zA-Z0-9_] between 1 and unlimited times.
See a working example here

You can use a regex like this:
/u/\w+

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.

Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.

By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?

No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.

Regular expressions are not always a replacement for a real parser.

You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex to extract mentions in Twitter - python

A shorter version, including the negative cases from #degant: (?<=#)\w+

Related

Removing markup links in text

Stripping non printable characters from a string in python?

Regex to identify Reddit usernames

How do I extract definitions from a html file?

Comments in string and strings in comments

Categories

Resources