Twitter data analysis - python

I have a question for my Thesis project.
In order to do a sentiment analysis, I would like to eliminate all hashtags, but with this Python code I remove only the "#". I would like to remove also the word associated to "#".
Thanks everyone
df['text']=df['text'].apply(lambda x:' '.join(re.findall(r'\w+', x)))

Assuming you want the rest of the words after the hashtag to remain intact, try this:
import re
df['text']=df['text'].apply(lambda x:(re.sub("#([\S]+)",'',x)))
It will remove any word(s) after the # until the next whitespace.

You can use the re.sub method. Something like that:
df["text"] = df["text"].apply (lambda x : re.sub (r"#.*\s", "", x))
In this way you replace everything that matches the pattern "#.*\s" (hashtag followed by any amount of characters followed by a space) with an empty string. You may need to tweak the regex a bit depending on your data.
Check the documentation about the re module here: https://docs.python.org/3/library/re.html

Related

Extracting a word between two path separators that comes after a specific word

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.
I had the few ideas on how to do it:
Write a RegEx that matches a string between \App\ and \feature\
Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.
I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.
I would much appreciate any help.
Thank you in advance!
The regex you want is:
(?<=\\App\\).*?(?=\\feature\\)
Explanation of the regex:
(?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
\ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
.* matches any character, zero or more times.
? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
The pattern in general also assumes that there are no \ characters between \App\ and \feature\.
The full code would be something like:
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'
pattern = rf"(?<=\{start}\).*?(?=\{end}\)"
print(pattern) # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0]) # Module
A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html
We can do that by str.find somethings like
str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'
print( (str[str.find(start)+len(start):str.rfind(end)]))
print("\n")
output
Module
Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.
(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)
The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Matching data between padding

I'm trying to match some strings in a binary file and the strings appear to be padded. As an example, the word PROGRAM could be in the binary like this:
%$###P^&#!)00000R{]]]////O.......G"""""R;;$#!*%&#*A/////847M
In that example, the word PROGRAM is there but it is split up and it's between random data, so I'm trying to use regex to find it.
Currently, this is what I came up with but I don't think this is very effectie:
(?<=P)(.*?)(?=R)(.*?)(?=O)(.*?)(?=G)(.*?)(?=R)(.*?)(?=A)(.*?)(?=M)
If you want to get PROGRAM from the string, one option might be to use re.sub with a negated character class to remove all that you don't want.
[^A-Z]+
Regex demo | Python demo
For example:
import re
test_str = "%$###P^&#!)00000R{]]]////O.......G\"\"\"\"\"R;;$#!*%&#*A/////847M"
pattern = r'[^A-Z]+'
print(re.sub(pattern, '', test_str))
Result
PROGRAM
This should work for you and is more efficient than your current solution:
P[^R]+R[^O]+O[^G]+G[^R]+R[^A]+A[^M]+M
Explanation:
P[^R]+ - match P, than one or more characters other than R
Demo
I'm not quite sure what the desired output might be, I'm guessing maybe this expression,
(?=.*?P.*?R.*?O.*?G.*?R.*?A.*?M).*?(P).*?(R).*?(O).*?(G).*?(R).*?(A).*?(M)
might be a start.
The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

get substring between character and whitespace

I am trying to get a specific substring from a text file that is always located between the word "in" and and open parenthesis. ex. in TEXT (blah). I am trying to get at TEXT.
currently i am using this
m = text[text.find("in")+1:text.find("(")]
This isn't working because other sections of the larger string sometimes contain the letters i and n. So I am thinking I should change it so it is specifically looking for instances of "in" followed by whitespace.
I cannot figure hot to incorporate \s to accomplish this. How would I do this?
Use a regular expression for this:
import re
preg = re.compile(r'(?<=in\s)(.*?)(?=\s\()')
for match in preg.finditer(text):
print(match.group(0))
I am using positive lookbehinds and lookaheads to check for "in " and " (".
Take a look here, it might help understanding the regular expression better.
Try this:
if text.find("in ") != -1:
m = text[text.find("in ")+3:text.find("(")]

regex, how to exlude search in match

Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?
No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)

Categories