How to select only rows containing emojis and emoticons in Python?

How to select only rows containing emojis and emoticons in Python? - python

I have DataFrame in Python Pandas like below:
sentence
------------
😎🤘🏾
I like it
+1😍😘
One :-) :)
hah
I need to select only rows containing emoticons or emojis, so as a result I need something like below:
sentence
------------
😎🤘🏾
+1😍😘
One :-) :)
How can I do that in Python ?

You can select the unicode emojis with a regex range:
df2 = df[df['sentence'].str.contains(r'[\u263a-\U0001f645]')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
This is however much more ambiguous for the ASCII "emojis" as there is no standard definition and probably endless combinations. If you limit it to the smiley faces that contain eyes ';:' and a mouth ')(' you could use:
df[df['sentence'].str.contains(r'[\u263a-\U0001f645]|(?:[:;]\S?[\)\(])')]
output:
sentence
0 😎🤘🏾
2 +1😍😘
3 One :-) :)
But you would be missing plenty of potential ASCII possibilities: :O, :P, 8D, etc.

Related

Python pandas: Remove emojis from DataFrame

I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.
index| messages
----------------
1 |Hello! 👋
2 |Good Morning 😃
3 |How are you ?
4 | Good 👍
5 | Ländern
Now I want to remove all these emojis from the DataFrame so it looks like this
index| messages
----------------
1 |Hello!
2 |Good Morning
3 |How are you ?
4 | Good
5 |Ländern
I tried the solution here but unfortunately it also removes all non-English letters like "ä"
How can I remove emojis from a dataframe?

This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024:
df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
Result:
messages
0 Länder
1 Hello!
Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.

I think the following is answering your question. I added some other characters for verification.
import pandas as pd
df = pd.DataFrame({'messages':['Hello! 👋', 'Good-Morning 😃', 'How are you ?', ' Goodé 👍', 'Ländern' ]})
df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))

Cleaning a dataset and removing special characters in python

I am fairly new to all of this so apologies in advance.
I've got a dataset (csv). One column contains strings with whole sentences. These sentences contain missinterpreted utf-8 charactes like â€™ and emojis like ðŸ¥³.
So the dataframe (df) looks kind of like this:
date text
0 Jul 31 2020 itâ€™s crazy. i hope post-covid we can get it doneðŸ¥³
1 Jul 31 2020 just sayinâ€™ ...
2 Jul 31 2020 nba to hold first games in 'bubble' amid pandemic
The goal is to do a sentiment analysis on the texts.
Would it be best to remove ALL special characters like , . ( ) [ ] + | - to do the sentiment analysis?
How do I do that and how do I also remove the missinterpreted utf-8 charactes like â€™?
I've tried it myself by using some code I found and changing that to my problem.
This resulted in this piece of code which seems to do absolutly nothing. The charactes like â€™ are still in the text.
spec_chars = ["â€¦","ðŸ¥³"]
for char in spec_chars:
df['text'] = df['text'].str.replace(char, ' ')
I'm a bit lost here.
I appreciate any help!

You can change the character encoding like this. x is one of the sentences in the original post.
x = 'itâ€™s crazy. i hope post-covid we can get it doneðŸ¥³'
x.encode('windows-1252').decode('utf8')
The result is 'it’s crazy. i hope post-covid we can get it done🥳'

As jsmart stated, use the .encode .decode. Since the column is a series, you's be using .str to access the values of the series as strings and apply the methods.
As far as the text sentiment, look at NLTK. And take a look at it's examples of sentiment analysis
import pandas as pd
df = pd.DataFrame([['Jul 31 2020','itâ€™s crazy. i hope post-covid we can get it doneðŸ¥³'],
['Jul 31 2020','just sayinâ€™ ...'],
['Jul 31 2020',"nba to hold first games in 'bubble' amid pandemic"]],
columns = ['date','text'])
df['text'] = df['text'].str.encode('windows-1252').str.decode('utf8')

Try this. It's quite helpful for me.
df['clean_text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word.isalnum()])

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?

You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous

While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.

mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Identifying list of regex expressions in Pandas column

I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)

Match all characters except a few using .(dot) in multiline string using Regex

My input string is as follows :
The dog is black
and beautiful
The dog and the cat
is black and beautiful
I want to replace 'black' to 'dark' only when the cat is not described .
So my output should be
The dog is dark
and beautiful
The dog and the cat
is black and beautiful
pRegex = re.compile(r'(The.*?(?!cat)ful)', re.DOTALL)
for i in pRegex.finditer(asm_file):
res = i.groups()
print res
With this , the 'black' is replaced in both the cases.
Is there anything wrong with the regex .
I am using python 2.7
Thanks

Regexp cannot describe a string based on a general negative expression ("not containing Z"). In your case you tried to express sth like "A string starting with X and ending with Y but NOT containing Z." That NOT containing is not possible in regexp. What your pattern resulted in to express was: "A string starting with X and ending with Y and containing at least one place which is not Z." Which does not help.
I suggest to search for the more general expression and then test it using sth like if 'cat' is in i:. That is straight forward and everybody is able to understand that.
A more sophisticated way could be to search for an alternative (OR) of two regexps, the first being one matching such expressions with cat inside, the other matching all expressions with that start and end part. If you then capture both alternatives in different groups, you can easily decide on the group filled which alternative you've got (with or without cat). But this only works if you can specify true separators between the groups which I figure you can't ;-) Anyway here's an example of what I mean:
r = re.compile(r'(The[^|]*?cat[^|]*?ful)|(The[^|]*?ful)')
text = 'The dog is black and beautiful | The dog and the cat is black and beautiful'
for i in r.finditer(text):
print i.groups()
prints:
(None, 'The dog is black and beautiful')
('The dog and the cat is black and beautiful', None)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select only rows containing emojis and emoticons in Python? - python

Related

Python pandas: Remove emojis from DataFrame

Cleaning a dataset and removing special characters in python

Extracting #mentions from tweets using findall python (Giving incorrect results)

Identifying list of regex expressions in Pandas column

Match all characters except a few using .(dot) in multiline string using Regex

Categories

Resources