How to add lemmatization and tokenization to scattertext

How to add lemmatization and tokenization to scattertext - python

I am using scattertext to parse a document in xlsx, but I am using non-English language and I will be most happy to add lemmatization and tokenization. I've checked these on spaCy alone and it works, but I have no clue how to integrate it in my scattertext plot.
import pandas as pd
import spacy
import pl_core_news_sm
nlp = spacy.load("pl_core_news_sm")
#nlp = pl_core_news_sm.load()
import scattertext as st
from pprint import pprint
from spacy.lang.pl.stop_words import STOP_WORDS
df = pd.read_excel("/home/poodle/Desktop/myfile.xlsx", sheet_name = 'Arkusz1', error_bad_lines = False)
corpus = st.CorpusFromPandas(
df,
category_col = 'Evaluation',
text_col = 'Opis',
nlp = st.whitespace_nlp_with_sentences).build().remove_terms(STOP_WORDS, ignore_absences=True)
html = st.produce_scattertext_explorer(corpus,
category = 'Nonsense',
category_name = 'Nonsense',
not_category_name = 'Correct',
minimum_term_frequency = 0,
width_in_pixels = 800,
metadata = corpus.get_df()['Autor'],
save_svg_button = True)
open('./Convention-Visualization6.html', 'wb').write(html.encode('utf-8'))
Is my code overall ok?

Scattertext has a specific pipeline for displaying lemmas instead of tokens.
To start, please use spaCy to parse your data frame of documents instead of scattertext's whitespace tokenizer.
I'm using spaCy's English parser here, but you should be sure to use a Polish version, if available.
import scattertext as st
import spacy
nlp = spacy.load('en')
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(nlp)
)
Next, I create a Scattertext corpus from the data frame, using the column containing the spaCy Doc objects we created int he previous step.
Also, we use the st.FeatsFromSpacyDoc(use_lemmas=True) feature extractor to extract lemmas instead of tokens.
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse',
feats_from_spacy_doc=st.FeatsFromSpacyDoc(use_lemmas=True)
)
I like use only unigrams (unilemmas in this case) and isolate the 2,000 most informative lemmas to display.
corpus = corpus.build().get_unigram_corpus().compact(st.AssociationCompactor(2000))
Finally, I create an html object which makes each axis in the plot the dense rank of a lemma's frequency.
html = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=0, pmi_threshold_coefficient=0,
width_in_pixels=1000, metadata=corpus.get_df()['speaker'],
transform=st.Scalers.dense_rank,
max_overlapping=3
)
open('./demo_lemmas.html', 'w').write(html)
print('open ./demo_lemmas.html in Chrome')

Related

Use Spacy with Pandas

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:
Messages = pd.read_csv('Messages.csv', encoding='cp1252')
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = nlp(Messages['Body'])._.cats
But it gives me the error:
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>
The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:
Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats
The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above

You should provide a callable into Series.apply call:
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
Here, each value in the NLP_Result column will be assigned to x variable.
The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.
import spacy
import classy classification
import csv
import pandas as pd
with open ('Deliveries.txt', 'r') as d:
Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
Not_Spam = n.read().splitlines()
data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam
# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

spaCy: spacy.tokens.doc.Doc to dataframe

I have a Spacy model for text generation, and I want to create a pandas data frame with all the texts that my Spacy model produces in each iteration. How can I save the spacy.tokens.doc.Doc output into a pandas dataframe?
nlp = spacy.load('en_core_web_sm')
newDataSet=pd.dataframe()
docs = nlp.pipe(df['Text'])
syn_augmenter =augmenty.load('random_synonym_insertion.v1',level=0.1)
for doc in augmenty.docs(docs, augmenter=syn_augmenter, nlp=nlp):
newDataSet=newDataSet.add(doc) # this produces an error

so you probably want to use DframCy library to make that happen. It is also recommended by SpaCy: https://spacy.io/universe/project/dframcy. A snippet I use is:
import spacy
from dframcy import DframCy
from tqdm import tqdm
nlp = spacy.load('en_core_web_trf')
dframcy = DframCy(nlp)
columns=["id", "text", "start", "end", "pos_", "tag_", "dep_", \
"head", "ent_type_", "lemma_", "lower_", "is_punct", "is_quote", "is_digit"]
def get_features(item):
doc = dframcy.nlp(item[1]["discourse_text"])
annotation_dataframe = dframcy.to_dataframe(doc, columns=columns)
annotation_dataframe['index'] = item[0]
return annotation_dataframe
results = []
for item in tqdm(df.iterrows(), total=df.shape[0]):
results.append(get_features(item))
features = pd.concat(results)
features
So the columns object denotes what objects you want to have returned. This is parsed to dframcy is extract the features and return a nice dataframe per document. If you have a table of strings that you want to tokenize and get features from, you need to iterate over it. TQDM tracks the overall progress of your for-loop. Concatenating the list of dataframes (per doc) will give you a complete overview.

Topic model for each row in dataframe

I have a subset of a dataframe that looks like (note, the new_tags are not exhaustively illustrated here):
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
I am hoping to iterate through each row (also termed a document) and conduct a topic model then extract the top 20 words from each topic into a csv. I am using Gensim.
I have the code that works for conducting the topic model, but I am unsure how to do this by row. The issue I think I am having is that when converting the df into a dictionary it doesn't allow me to subset it for the loop.
Here is my progress at the moment:
First, I want to tokenize and lemmatize the tags.
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']):
output = []
for sent in texts:
doc = nlp(sent)
output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
return output
#convert column to list
text_list=df['new_tags'].tolist()
#lemmatisation and tokenisation
tokenized_tags = lemmatization(text_list)
Next, I define a function to conduct a topic model and then write that to the csv.
i = 1
def topic_model(tokenized_tags):
''' this function is used to conduct topic modelling for each grid/document '''
for row in tokenized_tags:
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(row)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in row]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=40, random_state=100, chunksize=400, passes=50,iterations=100)
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
return pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv", mode='a', index = False, header=False)
i+=1
topic_model(tokenized_tags)
As a side note, is there a way to work out the optimal parameters e.g. coherence value for each document after running the topic model and somehow adjust the model to take in the best value?
Any help is very much appreciated! Thanks!
UPDATED CODE:
I've updated the function so I'm passing the tokenized version of the df and wanting to apply a topic model to each row and append that onto the df as a new column. How will I be able to do this?
tokens = central_edi_posts_grouped['new_tags'].astype(str).apply(nltk.word_tokenize)
def topic_model(central_edi_posts_grouped):
''' this function is used to conduct topic modelling for each grid/document '''
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(tokens)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in tokens]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=8, random_state=100,
chunksize=400, passes=50,iterations=100)
#let's check out the coheence number
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
pd.DataFrame(top_words_per_topic, coherence_lda, columns=['Topic', 'Word', 'P', 'Coherence_value']).to_csv("top_words_loop_test.csv", mode='a', index = False, header=False)
return coherence_lda
df['new_col'] = df['new_tags'].apply(lambda tokens: topic_model((tokens)))

You can use apply() function in Pandas to conduct row iterations.
df['new_col'] = df['new_tags'].apply(lambda text_list: topic_model(lemmatization(text_list)))
You may have to modify your topic_model() function a bit, so that it returns just the values you need, but not a pd.DataFrame.

pandas performance:text column replacement is slow

I have a large dataset with 250,000 entries, and the text column that i am processing contains a sentence is each row.
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')
from faker import Faker
fake = Faker()
df = pd.read_csv('my/huge/dataset.csv')
(e,g) --> df = pd.DataFrame({'text':['Michael Jackson was a famous singer and songwriter.']})
so from text file, I am trying to find names of people and replace them with fake names from the faker library and adding the result to a new column, as follows.
person_list = [[n.text for n in doc.ents] for doc in nlp_news_sm.pipe(df.text.values) if [n.label_ == 'PER' for n in doc.ents]]
flat_person_list = list(set([item for sublist in person_list for item in sublist]))
fake_person_name = [fake.name() for n in range(len(flat_person_list))]
name_dict = dict(zip(flat_person_list, fake_person_name))
df.name = df.text.replace(name_dict, regex=True)
The problem is that it is taking forever to run and I am not sure how to enhance the performance of the code, so it can run faster.

ok i think i found a better way of doing text replacement in pandas, thanks to Florian C's comment.
The Spacy model still takes a lot of time, but that part I cannot change, however, instead of str.replace, i decided to use map and lambda, so now the last line is as follows:
df.name = df.text.map(lambda x:name_dict.get(x,x))

How gather lists and load into dataframe

The following code creates a dataframe, tokenizes, and filters stopwords. However, am I stuck trying to properly gather the results to load back into a column of the dataframe. Trying to put the results back into the dataframe (using commented code) produces the following error ValueError: Length of values does not match length of index. It seems like the issue is with how I'm loading the lists back into the df. I think it is treating them one at a time. I'm not clear how to form a list of lists, which is what I think is needed. Neither append() nor extend() seem appropriate, or if they are I'm not doing it properly. Any insight would be greatly appreciated.
Minimal example
# Load libraries
import numpy as np
import pandas as pd
import spacy
# Create dataframe and tokenize
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = ''
doc = df['Text']
doc = doc.apply(lambda x: nlp(x))
df['Tokens'] = doc
# df # check dataframe
# Filter stopwords
df['No Stop'] = ''
def test_loc(df):
for i in df.index:
doc = df.loc[i,'Tokens']
tokens_no_stop = [token.text for token in doc if not token.is_stop]
print(tokens_no_stop)
# df['No Stop'] = tokens_no_stop # THIS PRODUCES AN ERROR
test_loc(df)
Result
['text', '.', 'sentences']
['second', 'text', ',', 'sentence']

As you mentioned you need a list of lists in order for the assignment to work.
Another solution can be to use pandas.apply as you used in the beginning of your code.
import numpy as np
import pandas as pd
import spacy
df = pd.DataFrame({'Text': ['This is the first text. It is two sentences',
'This is the second text, with one sentence']})
nlp = spacy.load("en_core_web_sm")
df['Tokens'] = df['Text'].apply(lambda x: nlp(x))
def remove_stop_words(tokens):
return [token.text for token in tokens if not token.is_stop]
df['No Stop'] = df['Tokens'].apply(remove_stop_words)
Notice you don't have to create the column before assigning to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add lemmatization and tokenization to scattertext - python

Related

Use Spacy with Pandas

spaCy: spacy.tokens.doc.Doc to dataframe

Topic model for each row in dataframe

pandas performance:text column replacement is slow

How gather lists and load into dataframe

Categories

Resources