Error in Data Processing in Gensim LDA using Pandas Dataframe - python

I am using Gensim LDA for the topic modelling. I am using pandas DataFrame for the processing. but I am getting an error
TypeError: decoding to str: need a bytes-like object, Series found
I need to process data using Pandas only, input data is like (one row)
PMID Text
12755608 The DNA complexation and condensation properties
12755609 Three proteins namely protective antigen PA edition
12755610 Lecithin retinol acyltransferase LRAT catalyze
My code is
data = pd.read_csv("h1.csv", delimiter = "\t")
data = data.dropna(axis=0, subset=['Text'])
data['Index'] = data.index
data["Text"] = data['Text'].str.replace('[^\w\s]','')
data.head()
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token):
result.append(lemmatize_stemming(token))
return result
input_data = data.Text.str.strip().str.split('[\W_]+')
print('\n\n tokenized and lemmatized document: ')
print(preprocess(input_data))

try this one
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
result.append(token)
return result
doc_processed = input_data['Text'].map(preprocess)
dictionary = corpora.Dictionary(doc_processed)
#to prepapre a document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_processed]
#Lda model
Lda = gensim.models.ldamodel.LdaModel
#Lda model to get the num_topics, number of topic required,
#passses is the number training do you want to perform
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=2)
result=ldamodel.print_topics(num_topics=5, num_words=15)

Related

spaCy: spacy.tokens.doc.Doc to dataframe

I have a Spacy model for text generation, and I want to create a pandas data frame with all the texts that my Spacy model produces in each iteration. How can I save the spacy.tokens.doc.Doc output into a pandas dataframe?
nlp = spacy.load('en_core_web_sm')
newDataSet=pd.dataframe()
docs = nlp.pipe(df['Text'])
syn_augmenter =augmenty.load('random_synonym_insertion.v1',level=0.1)
for doc in augmenty.docs(docs, augmenter=syn_augmenter, nlp=nlp):
newDataSet=newDataSet.add(doc) # this produces an error
so you probably want to use DframCy library to make that happen. It is also recommended by SpaCy: https://spacy.io/universe/project/dframcy. A snippet I use is:
import spacy
from dframcy import DframCy
from tqdm import tqdm
nlp = spacy.load('en_core_web_trf')
dframcy = DframCy(nlp)
columns=["id", "text", "start", "end", "pos_", "tag_", "dep_", \
"head", "ent_type_", "lemma_", "lower_", "is_punct", "is_quote", "is_digit"]
def get_features(item):
doc = dframcy.nlp(item[1]["discourse_text"])
annotation_dataframe = dframcy.to_dataframe(doc, columns=columns)
annotation_dataframe['index'] = item[0]
return annotation_dataframe
results = []
for item in tqdm(df.iterrows(), total=df.shape[0]):
results.append(get_features(item))
features = pd.concat(results)
features
So the columns object denotes what objects you want to have returned. This is parsed to dframcy is extract the features and return a nice dataframe per document. If you have a table of strings that you want to tokenize and get features from, you need to iterate over it. TQDM tracks the overall progress of your for-loop. Concatenating the list of dataframes (per doc) will give you a complete overview.

Python Error in NLP Function: "String indices must be integers"

I have individual CSV's and I am hoping to apply a topic model to each of them.
A CSV dataframe looks like:
<OUT>
PageNumber English_tags_only
59 people, trees, lego, water
The function I have defined is as:
def topic_model(grid_document):
''' this function is used to conduct topic modelling for each grid/document '''
#text_list= grid_document['english_only_tags'].tolist()
tokens = grid_document['english_only_tags'].astype(str).apply(nltk.word_tokenize)
#tokens = map(nltk.word_tokenize, grid_document)
#tokens = nltk.word_tokenize(grid_document)
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(tokens)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in tokens]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=15, random_state=100,
chunksize=400, passes=50,iterations=100)
#write top 20 words from each document as csv
#top_words_per_topic = []
#for t in range(lda_model.num_topics):
# top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
# create dataframe to capture main topic and perc contribution for document
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(lda_model[doc_term_matrix]):
row = row_list[0] if lda_model.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = lda_model.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(tokens)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
#add column names
sent_topics_df.reset_index()
sent_topics_df.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Text']
#create a new dataframe to capture most representative/highest probability keywords in dominant topic per document
sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = sent_topics_df.groupby('Dominant_Topic')
for i, grp in sent_topics_outdf_grpd:
sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, grp.sort_values(['Topic_Perc_Contrib'], ascending=False).head(1)],axis=0)
# Reset Index
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)
# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Representative Text"]
return sent_topics_sorteddf_mallet.to_csv("top_words_loop_dominant_topic.csv", mode = "a", index = False, header = False)
And in iterating through each CSV and applying the function I have:
from glob import glob
filenames = glob("Grid_Documents/grid*.csv")
print(filenames)
for f in filenames:
topic_model(f)
I am getting the error:
The function works if I manually load in each individual CSV, but when looped it comes with this error.
How would I be able to solve this? Thanks!

Topic model for each row in dataframe

I have a subset of a dataframe that looks like (note, the new_tags are not exhaustively illustrated here):
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
I am hoping to iterate through each row (also termed a document) and conduct a topic model then extract the top 20 words from each topic into a csv. I am using Gensim.
I have the code that works for conducting the topic model, but I am unsure how to do this by row. The issue I think I am having is that when converting the df into a dictionary it doesn't allow me to subset it for the loop.
Here is my progress at the moment:
First, I want to tokenize and lemmatize the tags.
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']):
output = []
for sent in texts:
doc = nlp(sent)
output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
return output
#convert column to list
text_list=df['new_tags'].tolist()
#lemmatisation and tokenisation
tokenized_tags = lemmatization(text_list)
Next, I define a function to conduct a topic model and then write that to the csv.
i = 1
def topic_model(tokenized_tags):
''' this function is used to conduct topic modelling for each grid/document '''
for row in tokenized_tags:
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(row)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in row]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=40, random_state=100, chunksize=400, passes=50,iterations=100)
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
return pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv", mode='a', index = False, header=False)
i+=1
topic_model(tokenized_tags)
As a side note, is there a way to work out the optimal parameters e.g. coherence value for each document after running the topic model and somehow adjust the model to take in the best value?
Any help is very much appreciated! Thanks!
UPDATED CODE:
I've updated the function so I'm passing the tokenized version of the df and wanting to apply a topic model to each row and append that onto the df as a new column. How will I be able to do this?
tokens = central_edi_posts_grouped['new_tags'].astype(str).apply(nltk.word_tokenize)
def topic_model(central_edi_posts_grouped):
''' this function is used to conduct topic modelling for each grid/document '''
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(tokens)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in tokens]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=8, random_state=100,
chunksize=400, passes=50,iterations=100)
#let's check out the coheence number
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
pd.DataFrame(top_words_per_topic, coherence_lda, columns=['Topic', 'Word', 'P', 'Coherence_value']).to_csv("top_words_loop_test.csv", mode='a', index = False, header=False)
return coherence_lda
df['new_col'] = df['new_tags'].apply(lambda tokens: topic_model((tokens)))
You can use apply() function in Pandas to conduct row iterations.
df['new_col'] = df['new_tags'].apply(lambda text_list: topic_model(lemmatization(text_list)))
You may have to modify your topic_model() function a bit, so that it returns just the values you need, but not a pd.DataFrame.

How to add lemmatization and tokenization to scattertext

I am using scattertext to parse a document in xlsx, but I am using non-English language and I will be most happy to add lemmatization and tokenization. I've checked these on spaCy alone and it works, but I have no clue how to integrate it in my scattertext plot.
import pandas as pd
import spacy
import pl_core_news_sm
nlp = spacy.load("pl_core_news_sm")
#nlp = pl_core_news_sm.load()
import scattertext as st
from pprint import pprint
from spacy.lang.pl.stop_words import STOP_WORDS
df = pd.read_excel("/home/poodle/Desktop/myfile.xlsx", sheet_name = 'Arkusz1', error_bad_lines = False)
corpus = st.CorpusFromPandas(
df,
category_col = 'Evaluation',
text_col = 'Opis',
nlp = st.whitespace_nlp_with_sentences).build().remove_terms(STOP_WORDS, ignore_absences=True)
html = st.produce_scattertext_explorer(corpus,
category = 'Nonsense',
category_name = 'Nonsense',
not_category_name = 'Correct',
minimum_term_frequency = 0,
width_in_pixels = 800,
metadata = corpus.get_df()['Autor'],
save_svg_button = True)
open('./Convention-Visualization6.html', 'wb').write(html.encode('utf-8'))
Is my code overall ok?
Scattertext has a specific pipeline for displaying lemmas instead of tokens.
To start, please use spaCy to parse your data frame of documents instead of scattertext's whitespace tokenizer.
I'm using spaCy's English parser here, but you should be sure to use a Polish version, if available.
import scattertext as st
import spacy
nlp = spacy.load('en')
df = st.SampleCorpora.ConventionData2012.get_data().assign(
parse=lambda df: df.text.apply(nlp)
)
Next, I create a Scattertext corpus from the data frame, using the column containing the spaCy Doc objects we created int he previous step.
Also, we use the st.FeatsFromSpacyDoc(use_lemmas=True) feature extractor to extract lemmas instead of tokens.
corpus = st.CorpusFromParsedDocuments(
df, category_col='party', parsed_col='parse',
feats_from_spacy_doc=st.FeatsFromSpacyDoc(use_lemmas=True)
)
I like use only unigrams (unilemmas in this case) and isolate the 2,000 most informative lemmas to display.
corpus = corpus.build().get_unigram_corpus().compact(st.AssociationCompactor(2000))
Finally, I create an html object which makes each axis in the plot the dense rank of a lemma's frequency.
html = st.produce_scattertext_explorer(
corpus,
category='democrat',
category_name='Democratic',
not_category_name='Republican',
minimum_term_frequency=0, pmi_threshold_coefficient=0,
width_in_pixels=1000, metadata=corpus.get_df()['speaker'],
transform=st.Scalers.dense_rank,
max_overlapping=3
)
open('./demo_lemmas.html', 'w').write(html)
print('open ./demo_lemmas.html in Chrome')

Panda AssertionError columns passed, passed data had 2 columns

I am working on Azure ML implementation on text analytics with NLTK, the following execution is throwing
AssertionError: 1 columns passed, passed data had 2 columns\r\nProcess returned with non-zero exit code 1
Below is the code
# The script MUST include the following function,
# which is the entry point for this module:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# import required packages
import pandas as pd
import nltk
import numpy as np
# tokenize the review text and store the word corpus
word_dict = {}
token_list = []
nltk.download(info_or_id='punkt', download_dir='C:/users/client/nltk_data')
nltk.download(info_or_id='maxent_treebank_pos_tagger', download_dir='C:/users/client/nltk_data')
for text in dataframe1["tweet_text"]:
tokens = nltk.word_tokenize(text.decode('utf8'))
tagged = nltk.pos_tag(tokens)
# convert feature vector to dataframe object
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
return [dataframe_output]
Error is throwing here
dataframe_output = pd.DataFrame(tagged, columns=['Output'])
I suspect this to be the tagged data type passed to dataframe, can some one let me know the right approach to add this to dataframe.
Try this:
dataframe_output = pd.DataFrame(tagged, columns=['Output', 'temp'])

Categories