Remove mentions and special characters for twitter dataset

Remove mentions and special characters for twitter dataset - python

I am trying to remove from this dataframe the mentions and special characters as "!?$..." and especially the character "#" but keeping the text of the hashtag.
Something like this is what I would like to have:
tweet clean_tweet
---------------------------------------------|-----------
"This is an example #user2 #Science ! #Tech" | "This is an example Science Tech"
"Hi How are you #user45 #USA" | "Hi How are you USA"
I am not sure how to iterate and do this in my dataframe in the column tweet
I tried with this for special characters
df["clean_tweet"] = df.columns.str.replace('[#,#,&]', '')
But I have this error
ValueError: Length of values (38) does not match length of index (82702)

You are trying to process column names
try this
df["clean_tweet"] = df["tweet"].str.replace('[#,#,&]', '')

I see you want to remove #user.So I used regex here
df['clean_tweet'] = df['tweet'].replace(regex='(#\w+)|#|&|!',value='')
tweet clean_tweet
0 This is an example #user2 #Science ! #Tech This is an example Science Tech
1 Hi How are you #user45 #USA Hi How are you USA

Related

python keyword search in csv comments

I am trying to do multiple keyword search in csv file just in column comments. for some reason when I try to search I get this error message 'DataFrame' object has no attribute 'description'
for example
table1.csv
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, ACAtesdfgsf , "2020/09/19" if you’re looking at a ship
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
Output
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
code
df = pd.read_csv('table1.csv')
df[df.description.str.contains('house| filter | table | car')]
df.to_csv('forum_fraud_date_keyword.csv')

You can use the following code for filtering using regex by .str.contains()
df = df.loc[df.comments.str.contains(r'\b(?:house|filter|table|car)\b')]
Here, we use r-string for including regex meta-characters.
We use \b to encompass the 4 target words so that it will only match whole word instead of partial string. E.g. carmen won't be matched with car, tablespoon would not match table. If you want to match partial string, you can remove the pair of \b in the regex above.
You can look at this Regex Demo for the matching demo.
Result:
print(df)
id_Acco, user_name, post_time comments
0 1543603, SameDavie , "2020/09/06" The car in the house
1 1543595, Johntim, "2020/09/11" You can filter the data
3 1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show

How do I match a string in a pandas column then return what follows it?

I have a pandas dataframe which contains a column containing twitter profile descriptions. In some of these description, there are strings like 'insta: profile_name'.
How can I create a line of code which would search for a string (eg, 'insta:' or 'instagram:') and then return the rest of the string of whatever is next to it?
1252: 'lad who loves to cook 🥘 • insta: xxx',
1254: 'founder and head chef | insta: xxx |',
1992: '🇬🇧 |bakery instagram - xxx',
2291: 'insta: #xxx for enquiries'
2336: 'self taught baker. ig:// xxxx 🍥🧆',

You can use Regex to match each of the keywords such as: Insta
The code should be something like this:
import re
container = list()
for word in [list of keywords, ex: "insta","face"]:
_tag = re.findall( word + 'Regex Syntax', the_string_to_find_from)
container.append([word,_tag])
then you can unpack the resulted Container variable when you want to get the result. I can help you write the Regex syntax but I need more information on the way your required information is wrapped in the text.

Answer provided by Nk03 in the comments:
df['name'].str.extract(pat = r'(insta:|ig:)(.*)')[1].str.strip('\',')

Python pandas: Remove emojis from DataFrame

I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.
index| messages
----------------
1 |Hello! 👋
2 |Good Morning 😃
3 |How are you ?
4 | Good 👍
5 | Ländern
Now I want to remove all these emojis from the DataFrame so it looks like this
index| messages
----------------
1 |Hello!
2 |Good Morning
3 |How are you ?
4 | Good
5 |Ländern
I tried the solution here but unfortunately it also removes all non-English letters like "ä"
How can I remove emojis from a dataframe?

This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024:
df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
Result:
messages
0 Länder
1 Hello!
Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.

I think the following is answering your question. I added some other characters for verification.
import pandas as pd
df = pd.DataFrame({'messages':['Hello! 👋', 'Good-Morning 😃', 'How are you ?', ' Goodé 👍', 'Ländern' ]})
df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))

Regex named groups in R

For all intents and purposes, I am a Python user and use the Pandas library on a daily basis. The named capture groups in regex is extremely useful. So, for example, it is relatively trivial to extract occurrences of specific words or phrases and to produce concatenated strings of the results in new columns of a dataframe. An example of how this might be achieved is given below:
import numpy as np
import pandas as pd
import re
myDF = pd.DataFrame(['Here is some text',
'We all love TEXT',
'Where is the TXT or txt textfile',
'Words and words',
'Just a few works',
'See the text',
'both words and text'],columns=['origText'])
print("Original dataframe\n------------------")
print(myDF)
# Define regex to find occurrences of 'text' or 'word' as separate named groups
myRegex = """(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"""
myCompiledRegex = re.compile(myRegex,flags=re.I|re.X)
# Extract all occurrences of 'text' or 'word'
myMatchesDF = myDF['origText'].str.extractall(myCompiledRegex)
print("\nDataframe of matches (with multi-index)\n--------------------")
print(myMatchesDF)
# Collapse resulting multi-index dataframe into single rows with concatenated fields
myConcatMatchesDF = myMatchesDF.groupby(level = 0).agg(lambda x: '///'.join(x.fillna('')))
myConcatMatchesDF = myConcatMatchesDF.replace(to_replace = "^/+|/+$",value = "",regex = True) # Remove '///' at start and end of strings
print("\nCollapsed and concatenated matches\n----------------------------------")
print(myConcatMatchesDF)
myDF = myDF.join(myConcatMatchesDF)
print("\nFinal joined dataframe\n----------------------")
print(myDF)
This produces the following output:
Original dataframe
------------------
origText
0 Here is some text
1 We all love TEXT
2 Where is the TXT or txt textfile
3 Words and words
4 Just a few works
5 See the text
6 both words and text
Dataframe of matches (with multi-index)
--------------------
textOcc wordOcc
match
0 0 text NaN
1 0 TEXT NaN
2 0 TXT NaN
1 txt NaN
2 text NaN
3 0 NaN Word
1 NaN word
5 0 text NaN
6 0 NaN word
1 text NaN
Collapsed and concatenated matches
----------------------------------
textOcc wordOcc
0 text
1 TEXT
2 TXT///txt///text
3 Word///word
5 text
6 text word
Final joined dataframe
----------------------
origText textOcc wordOcc
0 Here is some text text
1 We all love TEXT TEXT
2 Where is the TXT or txt textfile TXT///txt///text
3 Words and words Word///word
4 Just a few works NaN NaN
5 See the text text
6 both words and text text word
I've printed each stage to try to make it easy to follow.
The question is, can I do something similar in R. I've searched the web but can't find anything that describes the use of named groups (although I'm an R-newcomer and so might be searching for the wrong libraries or descriptive terms).
I've been able to identify those items that contain one or more matches but I cannot see how to extract specific matches or how to make use of the named groups. The code I have so far (using the same dataframe and regex as in the Python example above) is:
origText = c('Here is some text','We all love TEXT','Where is the TXT or txt textfile','Words and words','Just a few works','See the text','both words and text')
myDF = data.frame(origText)
myRegex = "(?P<textOcc>t[e]?xt)|(?P<wordOcc>word)"
myMatches = grep(myRegex,myDF$origText,perl=TRUE,value=TRUE,ignore.case=TRUE)
myMatches
[1] "Here is some text" "We all love TEXT" "Where is the TXT or txt textfile" "Words and words"
[5] "See the text" "both words and text"
myMatchesRow = grep(myRegex,myDF$origText,perl=TRUE,value=FALSE,ignore.case=TRUE)
myMatchesRow
[1] 1 2 3 4 6 7
The regex seems to be working and the correct rows are identified as containing a match (i.e. all except row 5 in the above example). However, my question is, can I produce an output that is similar to that produced by Python where the specific matches are extracted and listed in new columns in the dataframe that are named using the group names contained in the regex?

Base R does capture the information about the names but it doesn't have a good helper to extract them by name. I write a wrapper to help called regcapturedmatches. You can use it with
myRegex = "(?<textOcc>t[e]?xt)|(?<wordOcc>word)"
m<-regexpr(myRegex, origText, perl=T, ignore.case=T)
regcapturedmatches(origText,m)
Which returns
textOcc wordOcc
[1,] "text" ""
[2,] "TEXT" ""
[3,] "TXT" ""
[4,] "" "Word"
[5,] "" ""
[6,] "text" ""
[7,] "" "word"

Error: match word in file

There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?

Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person

you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove mentions and special characters for twitter dataset - python

You are trying to process column names try this df["clean_tweet"] = df["tweet"].str.replace('[#,#,&]', '')

I see you want to remove #user.So I used regex here df['clean_tweet'] = df['tweet'].replace(regex='(#\w+)|#|&|!',value='') tweet clean_tweet 0 This is an example #user2 #Science ! #Tech This is an example Science Tech 1 Hi How are you #user45 #USA Hi How are you USA

Related

python keyword search in csv comments

How do I match a string in a pandas column then return what follows it?

Python pandas: Remove emojis from DataFrame

Regex named groups in R

Error: match word in file

Categories

Resources