I am new to text analysis with Python and struggling to clean my data from special characters.
I have survey data where one of the columns has comments. I want to analyse these comments and find the most frequent words. I try to exclude the stop words by using pandas.Series.str.replace.
Here is my code:
stop_words = set(stopwords.words('english'))
# get the relevant column from the dataset:
df_comments = df.iloc[:,[-3]].dropna()
#clean it from stop words
pat = r'\b(?:{})\b'.format('|'.join(stop_words))
df['comment_without_stopwords'] = df["comment"].str.replace(pat, '')
df['comment_without_stopwords'] = df['comment_without_stopwords'].str.replace(r'\s+', ' ')
# get the most frequent 20 words:
result = df['comment_without_stopwords'].str.split(expand=True).stack().value_counts(normalize=True, ascending = False).head(20)
But as a result I get the following characters: ., and - in my top list as can be seen below. How can I get rid of them?
staff 0.015001
need 0.009265
work 0.007942
- 0.007059
action 0.006618
project 0.005074
contract 0.005074
. 0.004853
field 0.004412
support 0.004412
employees 0.004191
projects 0.004191
HR 0.003971
time 0.003971
HQ 0.003971
needs 0.003750
field 0.003530
training 0.003530
capacity 0.003530
good 0.003530
dtype: float64
Related
I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.
I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist() # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
model.match(df_1_list, shortest_word_list)
except ValueError:
print("Empty Dataframe, Can't Match - Check the URL Filter!")
sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de
Ok, so the title might sound a bit confusing, but here's an analogy of what I'm trying to achieve. Let's imagine that we have the following dataset:
Brand name
Product type
Product_Description
Nike
Shoes
These black shoes are wonderful. They are elegant, and really comfortable
BMW
Car
This car goes fast. If you like speed, you'll like it.
Suzuki
Car
A family car, elegant and made for long journeys.
Call of Duty
VideoGame
A nervous shooter, putting you in the shoes of a desperate soldier, who has nothing left to lose.
Adidas
Shoes
Sneakers made for men, and women, who always want to go out with style.
This is just a made-up sample, but let's imagine this list goes on for a lot of other products.
What I'm trying to achieve here, is to cluster the elements (whether it is shoes, cars, or videogames), based on the words used in their respective description. Thus, I would obtain brands that are clustered together according to their description, but perhaps not belonging to the same type (e.g.: Suzuki + Adidas), and to get the name of the brands that are clustered together.
To do so, I relied on a word embedding method. After cleaning the description (stop words, non-alphanumeric characters) and tokenized it, I used a FastText model (the Wikipedia one) to evaluate the embeddings in the product descriptions.
def clean_text(text, tokenizer, stopwords):
text = str(text).lower() # Lowercase words
text = re.sub(r"\[(.*?)\]", "", text) # Remove [+XYZ chars] in content
text = re.sub(r"\s+", " ", text) # Remove multiple spaces in content
text = re.sub(r"\w+…|…", "", text) # Remove ellipsis (and last word)
text = re.sub(r"<a[^>]*>(.*?)</a>", r"\1", text) #Remove html tags
text = re.sub(f"[{re.escape(punctuation)}]", "", text)
text = re.sub(r"(?<=\w)-(?=\w)", " ", text) # Replace dash between words
text = re.sub(
f"[{re.escape(string.punctuation)}]", "", text
) # Remove punctuation
doc = nlp_model(text)
tokens = [token.lemma_ for token in doc]
#tokens = tokenizer(text) # Get tokens from text
tokens = [t for t in tokens if not t in stopwords] # Remove stopwords
tokens = ["" if t.isdigit() else t for t in tokens] # Remove digits
tokens = [t for t in tokens if len(t) > 1] # Remove short tokens
return tokens #Clean the Text
def sent_vectorizer(sent, model):
sent_vec =[]
numw = 0
for w in sent:
try:
if numw == 0:
sent_vec = model[w]
else:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return np.asarray(sent_vec) / numw
df = pd.read_csv("./mockup.csv")
custom_stopwords = set(stopwords.words("english"))
df["Product_Description"] = df["Product_Description"].fillna("")
df["tokens"] = df["Product_Description"].map(lambda x: clean_text(x, word_tokenize, custom_stopwords))
model = KeyedVectors.load_word2vec_format('./wiki-news-300d-1M.vec')
The problem is that I'm a bit of a beginner in word embeddings and clustering. As I said, my goal would be to cluster brands according to the words used in their description (the hypothesis is perhaps some brands are linked together through the words used in their description?), thus forgoing the old classification (shoes, cars, videogames...). I would also like to get the key brands of each cluster (so cluster 1 = Suzuki + Adidas, Cluster 2 = Call of Duty + Nike, Cluster 3 = BMW + ..., etc...).
Does anyone have any ideas on how to tackle this problem? I read several tutorials online on word embeddings and clustering, and to be completely honest, I am a bit lost.
Thank you for your help.
I have a dataset containing 9000 sentences from which I need 20/20 statements based upon some conditions. However, when I try to match those conditions either the sentence is outputted or the conditions are not met. The first 20 sentences should contain one verb.
For the second part I would like to have sentences that contain more than 2 verbs.
Right now I have the following code for checking if the amount of verbs is less than 2
import re
import spacy
import en_core_web_md
nlp=en_core_web_md.load()
test = "This sentence has just 1 verb"
test2 = "I have put multiple verbs in this sentence because it is possible and I want it"
doc1 = nlp(test)
doc2 = nlp(test2)
empt = []
for item in doc1.sents:
verbs = 0
for token in item:
if token.pos_ == "VERB":
verbs += 1
if verbs < 2:
empt.append(item)
However, I end up with an empty list.
Can someone tell me what I am doing wrong so i can adjust this code for every additional condition?
You just need to pull the last two lines back two indentation levels. You only want to check the number of verbs in the entire sentence after all the tokens have been considered.
I have multiple emails with a list of stock, price and quantity. Each day, the list is formatted a little differently and I was hoping to use NLP to try to understand read in the data and reformat it to show the information in a correct format.
Here is a sample of the emails I receive:
Symbol Quantity Rate
AAPL 16 104
MSFT 8.3k 56.24
GS 34 103.1
RM 3,400 -10
APRN 6k 11
NP 14,000 -44
As we can see, the quantity is in varying formats, the ticker always is standard but the rate is either positive or negative or could have decimals. Another issue is that the headers are not always the same so that is not an identifier that I can rely on.
So far I've seen some examples online where this works for names but I am unable to implement this for stock ticker, quantity and price. The code I've tried so far is below:
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
To: "Anna Jones" <anna.jones#mm.com>
From: James B.
Hey,
This week has been crazy. Attached is my report on IBM. Can you give it a quick read and provide some feedback.
Also, make sure you reach out to Claire (claire#xyz.com).
You're the best.
Cheers,
George W.
212-555-1234
"""
def extract_phone_numbers(string):
r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
phone_numbers = r.findall(string)
return [re.sub(r'\D', '', number) for number in phone_numbers]
def extract_email_addresses(string):
r = re.compile(r'[\w\.-]+#[\w\.-]+')
return r.findall(string)
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
numbers = extract_phone_numbers(string)
emails = extract_email_addresses(string)
names = extract_names(string)
print(numbers)
print(emails)
print(names)
This code does a good job with numbers, emails and names but I am unable to replicate this for the example I have and do not really know how to go about it. Any tips will be more than helpful.
You can construct the regexes that will check for numbers and amounts.
For the sticks however, you will have to do something differently. I suspect that the stock names are not always written in uppercase letters in email. If they are then just write a script that will utilize an API from some of the stock exchanges and run only the words that have all the letters in the uppercase form. But, if the stock names are not written in uppercase letters in the emails, you can do several things. You can check every word from the email against that stock exchange if it's a stick name. If you want to speed up that process, you can try doing dependency parsing and run only the nouns or pronouns against the API.
I have a large pandas dataframe. A column contains text broken down into sentences, one sentence per row. I need to check the sentences for the presence of terms used in various ontologies. Some of the ontologies are fairly large and have more than 100.000 entries. In addition some of the ontologies contain molecule names with hyphens, commas, and other characters that may or may not be present in the text to be examined, hence, the need for regular expressions.
I came up with the code below, but it's not fast enough to deal with my data. Any suggestions are welcome.
Thank you!
import pandas as pd
import re
sentences = ["""There is no point in driving yourself mad trying to stop
yourself going mad""",
"The ships hung in the sky in much the same way that bricks don’t"]
sentence_number = list(range(0, len(sentences)))
d = {'sentence' : sentences, 'number' : sentence_number}
df = pd.DataFrame(d)
regexes = ['\\bt\\w+', '\\bs\\w+']
big_regex = '|'.join(regexes)
compiled_regex = re.compile(big_regex, re.I)
df['found_regexes'] = df.sentence.str.findall(compiled_regex)