Based on 37,000 article texts, I implemented LDA mallet topic modeling. Each article was properly categorized and the dominant topic of each was determined.
Now I want to create a dataframe that shows each topic's percentages for each article, in Python.
I want the data frame to look like this:
no | Text | Topic_Num_1 | Topic_Num_2 | .... | Topic_Num_25
01 | article text1 | 0.7529 | 0.0034 | .... | 0.0011
02 | article text2 | 0.3529 | 0.0124 | .... | 0.0001
....
(37000 x 27 row)
How would I do this?
+
All the code I've been doing is based on the following site.
http://machinelearningplus.com/nlp/topic-modeling-gensim-python
How can I see the all probability list of the topics of every single article?
Here's a useful link for anyone that has just discovered this question.
I'm also pasting some example code, assuming that you have built a LDA-model and that you want to concatenate the topic-scores to a dataframe df.
import gensim
import numpy as np
lda_model = gensim.models.LdaMulticore(corpus, id2word, num_topics)
lda_scores = lda_model[corpus]
all_topics_csr = gensim.matutils.corpus2csc(lda_scores)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_pandas = pd.DataFrame(all_topics_numpy).reindex(df1.index).fillna(0)
df = pd.concat([df, all_topics_pandas.reindex(df.index)], axis=1, join="inner")
Related
I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.
Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')
Comments
>>>
reviews
0 One of the rare films where every discussion leaving the theater is about how much you
just had, instead of an analysis of its quotients.
1 Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving,
and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that
re-watchability factor.
I loaded the model like this
import spacy
nlp = spacy.load("en_core_news_sm")
And using sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))
But when I check the sentence is in just one row like this
[One of the rare films where every discussion leaving the theater is about how much you just had.,
Instead of an analysis of its quotients.]
Thanks a lot for any help. I'm new using NLP tools in Data Frame.
Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.
comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
{'reviews': 'This is the first sentence of the second review. And this is the second.'}]
comments = pd.DataFrame(comments) # building your input DataFrame
+----+--------------------------------------------------------------------------+
| | reviews |
|----+--------------------------------------------------------------------------|
| 0 | This is the first sentence of the first review. And this is the second. |
| 1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+
Now let's define a function which, given a string, returns the list of its sentences as texts (strings).
def obtain_sentences(s):
doc = nlp(s)
sents = [sent.text for sent in doc.sents]
return sents
The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.
data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data
I used explode to transform the elements of the lists of sentences into rows.
And this is the obtained output!
+----+--------------------------------------------------+
| | reviews |
|----+--------------------------------------------------|
| 0 | This is the first sentence of the first review. |
| 1 | And this is the second. |
| 2 | This is the first sentence of the second review. |
| 3 | And this is the second. |
+----+--------------------------------------------------+
I am working on optimizing the below operation whose exection time is relatively high on the actual dataset(large datset).I tried below on two of the pyspark dataset 1 & 2 to arrive at the "page_category" column of dataset-2
pyspark dataset-1 :
page_click | page_category
---------------------------
facebook | Social_network
insta | Social_nework
coursera | educational
Another dataset on which i am applying the create_map operation looks like :
pyspark dataset-2 :
id | page_click
---------------
1 | facebook
2 |Coursera
I am creating the dictionary of the dataset-1 and applying the
page_map = create_map([lit(x) for x in chain(*dict_dataset_1.items()])
dataset_2.withColumn('page_category', page_map[dataset_2['page_click']])
and then performing with_column on 'page_click' column of dataset-2 to arrive at the another column called 'page_category'
final dataset :
id | page_click | Page_category
-------------------------------
1 | facebook |social_network
2 |Coursera |educational
But this operation is taking too much time to complete, more than 4-5 minutes. Is there another way to speed up the operation ?
Thank you
Implement a simple broadcast join
df2.join(broadcast(df1),df2.page_click==df1.page_click,'left').\
select(df2.id, df2.page_click, df1.page_category).show()
+---+----------+--------------+
| id|page_click| page_category|
+---+----------+--------------+
| 1| facebook|Social_network|
| 2| coursera| educational|
+---+----------+--------------+
I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')
from textblob import TextBlob
def sentiment_calc(text):
try:
return TextBlob(text).sentiment
except:
return None
test_df['sentiment score'] = test_df['text'].apply(sentiment_calc)
test_df
I recently ran a code on my dataset to implement sentiment analysis using the TextBlob package. After running that, my sentiment column has the following output below (I did an example table with dummy numbers below).
text | sentiment score
------------------------
nice | (0.45, 4.33)
good | (0.45, 4.33)
ok | (0.45, 4.33)
And the output I would like to get is this, where I split the sentiment column into two columns, but add those columns onto the current dataframe.
text | polarity | subjectivity
------------------------------
nice |0.45 | 0.433
good |0.45 | 0.433
ok |0.45 | 0.433
Is there a way to do this in Python 2.7?
This is what you want to do with pandas:
sentiment_series = df['sentiment score'].tolist()
columns = ['polarity', 'subjectivity']
df = pd.DataFrame(sentiment_series, columns=columns, index=df.index)
I've generated a dataframe data from crosstab in Spark DataFrame and want to perform the chi-squared test.
It seems that Statistics.chiSqTest can only be applied to a matrix. My DataFrame looks like as below and I want to see whether the level distribution is the same across three groups:
true
false
and Undefined.
from pyspark.mllib.stat import Statistics
+-----------------------------+-------+--------+----------+
|levels | true| false|Undefined |
+-----------------------------+-------+--------+----------+
| 1 |32783 |634460 |2732340 |
| 2 | 2139 | 41248 |54855 |
| 3 |28837 |573746 |5632147 |
| 4 |16473 |320529 |8852552 |
+-----------------------------+-------+--------+----------+
Is there any easy way to transform this in order to be used for chi-squared test?
One way to handle this without using mllib.Statistics:
import scipy.stats
crosstab = ...
scipy.stats.chi2_contingency(
crosstab.drop(crosstab.columns[0]).toPandas().as_matrix()
)
If you really want Spark statistics:
from itertools import chain
Statistics.chiSqTest(DenseMatrix(
numRows=crosstab.count(), numCols=len(crosstab.columns) - 1,
values=list(chain(*zip(*crosstab.drop(crosstab.columns[0]).collect())))
))