Python pandas: Remove emojis from DataFrame - python

I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.
index| messages
----------------
1 |Hello! 👋
2 |Good Morning 😃
3 |How are you ?
4 | Good 👍
5 | Ländern
Now I want to remove all these emojis from the DataFrame so it looks like this
index| messages
----------------
1 |Hello!
2 |Good Morning
3 |How are you ?
4 | Good
5 |Ländern
I tried the solution here but unfortunately it also removes all non-English letters like "ä"
How can I remove emojis from a dataframe?

This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024:
df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
Result:
messages
0 Länder
1 Hello!
Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.

I think the following is answering your question. I added some other characters for verification.
import pandas as pd
df = pd.DataFrame({'messages':['Hello! 👋', 'Good-Morning 😃', 'How are you ?', ' Goodé 👍', 'Ländern' ]})
df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))

Related

How to select only rows containing emojis and emoticons in Python?

I have DataFrame in Python Pandas like below:
sentence
------------
😎🤘🏾
I like it
+1😍😘
One :-) :)
hah
I need to select only rows containing emoticons or emojis, so as a result I need something like below:
sentence
------------
😎🤘🏾
+1😍😘
One :-) :)
How can I do that in Python ?
You can select the unicode emojis with a regex range:
df2 = df[df['sentence'].str.contains(r'[\u263a-\U0001f645]')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
This is however much more ambiguous for the ASCII "emojis" as there is no standard definition and probably endless combinations. If you limit it to the smiley faces that contain eyes ';:' and a mouth ')(' you could use:
df[df['sentence'].str.contains(r'[\u263a-\U0001f645]|(?:[:;]\S?[\)\(])')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
3 One :-) :)
But you would be missing plenty of potential ASCII possibilities: :O, :P, 8D, etc.

Remove mentions and special characters for twitter dataset

I am trying to remove from this dataframe the mentions and special characters as "!?$..." and especially the character "#" but keeping the text of the hashtag.
Something like this is what I would like to have:
tweet clean_tweet
---------------------------------------------|-----------
"This is an example #user2 #Science ! #Tech" | "This is an example Science Tech"
"Hi How are you #user45 #USA" | "Hi How are you USA"
I am not sure how to iterate and do this in my dataframe in the column tweet
I tried with this for special characters
df["clean_tweet"] = df.columns.str.replace('[#,#,&]', '')
But I have this error
ValueError: Length of values (38) does not match length of index (82702)
You are trying to process column names
try this
df["clean_tweet"] = df["tweet"].str.replace('[#,#,&]', '')
I see you want to remove #user.So I used regex here
df['clean_tweet'] = df['tweet'].replace(regex='(#\w+)|#|&|!',value='')
tweet clean_tweet
0 This is an example #user2 #Science ! #Tech This is an example Science Tech
1 Hi How are you #user45 #USA Hi How are you USA

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like ’ and emojis like 🥳.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 it’s crazy. i hope post-covid we can get it done🥳
1 Jul 31 2020 just sayin’ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like ’?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like ’ are still in the text.
spec_chars = ["…","🥳"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!
You can change the character encoding like this. x is one of the sentences in the original post.
x = 'it’s crazy. i hope post-covid we can get it done🥳'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'
As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','it’s crazy. i hope post-covid we can get it done🥳'],
['Jul 31 2020','just sayin’ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')
Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

regex as a separator to read tables in python (Pandas)

I would like to ask some help to read a text file (Python 2.7, pandas library) that is using "|" as a separator, but you can also find the same character in the records followed by space. The first two rows don't have the problem, but the third one has the separator in between the 6th field TAT Fans | Southern
1. 4_230_0415_99312||||9500|Gedung|||||||||15000|6.11403|102.23061
2. 4_230_0415_99313||||9500|Pakatan|||||||||50450|3.15908|101.71431
3. 4_230_0117_12377||||9990|TAT Fans | Southern||||||||||3.141033333|101.727125
I have been trying to use regex in the separator, but I haven't been able to make it work :
pd.read_table("text_file.txt", sep = "\S+\|\S+")
Can Anyone help me find a solution to my problem?
Many thanks in advance!
You can use "\s?[|]+\s?"
import pandas as pd
pd.read_table("text_file.txt", sep="\s?[|]+\s?") #or "\s?\|+\s?"
Out[18]:
4_230_0415_99312 9500 Gedung 15000 6.11403 102.23061
0 4_230_0415_99313 9500 Pakatan 50450 3.159080 101.714310
1 4_230_0117_12377 9990 TAT Fans Southern 3.141033 101.727125

Search strings with python

I would like to know how can I search specific strings with python. Actually I opened a markdown file which contain a sheet like below:
| --------- | -------- | --------- |
|**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. |
|**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|
|**unscrewed**| - | -Slowly and very carefully he unscrewed the ink bottle, dipped his quill into it, and began to write,|
|**downtrodden**| - | -For years, Aunt Petunia and Uncle Vernon had hoped that if they kept Harry as downtrodden as possible, they would be able to squash the magic out of him.|
|**sheets,**| - | -As long as he didn’t leave spots of ink on the sheets, the Dursleys need never know that he was studying magic by night.|
|**flinch**| - | -But he hoped she’d be back soon — she was the only living creature in this house who didn’t flinch at the sight of him.|
And I have to get the strings from each lines which decorates with |** **|, like:
propped
Pointless
unscrewed
downtrodden
sheets
flinch
I tried to use the regular expression but failed to extract it.
import re
y = '(?<=\|\*{2}).+?(?=,{0,1}\*{2}\|)'
reg = re.compile(y)
a = '| --------- | -------- | --------- | |**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. | |**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|'
reg.findall(a)
Regex(y) above explained:
(?<=\|\*{2}) - Matches if the current position in the string is preceded by a match for \|\*{2} i.e. |**
.+? - Will try to find anything(except for new line) repeated 1 or more times. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
(?=,{0,1}\*{2}\|) - ?= matches any string preceding the regex mentioned. In this case I have mentioned ,{0,1}\*{2}\|, which means zero or one , and 2 * and ending |.
Try using the following regex :
(?<=\|)(?!\s).*?(?!\s)(?=\|)
see demo / explanation
If the asterisks are in the text you are searching and you do not want the comma after sheets. The pattern would be a pipe followed by two asterisks then anything that follows that is not an asterisk or a comma.
\|\*{2}([^*,]+)
If you can live with the comma or if there might be commas you want to catch
\|\*{2}([^*]+)
Use either pattern with re.findall or re.finditer to capture the text you want.
If using the second pattern, you would need to run through the groups and strip any unwanted commas.
I have wrote below program to achieve the required output. I created a file string_test where all raw strings I copied:
a=re.compile("^\|\*\*([^*,]+)")
with open("string_test","r") as file1:
for i in file1.readlines():
match=a.search(i)
if match:
print match.group(1)

Categories