Split data frame of comments into multiple rows - python

I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.
Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')
Comments
>>>
reviews
0 One of the rare films where every discussion leaving the theater is about how much you
just had, instead of an analysis of its quotients.
1 Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving,
and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that
re-watchability factor.
I loaded the model like this
import spacy
nlp = spacy.load("en_core_news_sm")
And using sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))
But when I check the sentence is in just one row like this
[One of the rare films where every discussion leaving the theater is about how much you just had.,
Instead of an analysis of its quotients.]
Thanks a lot for any help. I'm new using NLP tools in Data Frame.

Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.
comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
{'reviews': 'This is the first sentence of the second review. And this is the second.'}]
comments = pd.DataFrame(comments) # building your input DataFrame
+----+--------------------------------------------------------------------------+
| | reviews |
|----+--------------------------------------------------------------------------|
| 0 | This is the first sentence of the first review. And this is the second. |
| 1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+
Now let's define a function which, given a string, returns the list of its sentences as texts (strings).
def obtain_sentences(s):
doc = nlp(s)
sents = [sent.text for sent in doc.sents]
return sents
The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.
data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data
I used explode to transform the elements of the lists of sentences into rows.
And this is the obtained output!
+----+--------------------------------------------------+
| | reviews |
|----+--------------------------------------------------|
| 0 | This is the first sentence of the first review. |
| 1 | And this is the second. |
| 2 | This is the first sentence of the second review. |
| 3 | And this is the second. |
+----+--------------------------------------------------+

Related

Apyori, outputting a rule with more than two items isn't working

I'm a total python noob and programming beginner, but I'm trying to analyse some fictional Data using apyori in python for school.
I didn't write most of this program myself, I got it from a Jupyter Notebook my teacher gave me and I understand most of it, except for the actual creation of the data frame we output at the end.
My biggest issue is this though:
If I output the entire rule for one of the rows in my data frame, with
print(association_results[0])
I get this output:
RelationRecord(items=frozenset({'Spiderman 3', 'Moana', 'Tomb Rider'}), support=0.007199040127982935, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Moana', 'Tomb Rider'}), items_add=frozenset({'Spiderman 3'}), confidence=0.20300751879699247, lift=3.0825089038385434)])
If I understand it correctly that should essentially mean "When someone buys Moana and Tomb Rider they're likely to also buy Spiderman 3", however in my Database at the end I only get the output "When someone buys Moana they're likely to also buy Spiderman 3".
That happens for multiple rows in my data frame and I couldn't find an example online of anyone having a rule including two items as the "when", so I don't understand how I can output both movies into the data frame.
import pandas as pd
from apyori import apriori
movie_data = pd.read_csv(r"C:\Users\XY\Desktop\movie_dataset.csv", header = None) #import data
num_records = len(movie_data)
#########
records = []
for i in range(0, num_records):
records.append([str(movie_data.values[i,j]) for j in range(0, 20)])
for i, j in enumerate(records): # deletes empty items out of the data frame (this was an issue before)
while 'nan' in records[i]: # because it output "when Green Lantern is bought, nothing is bought as well)
records[i].remove('nan')
association_rules = apriori(records, min_support=0.0053, min_confidence=0.20, min_lift=3, min_length=2)
association_results = list(association_rules)
results = []
for item in association_results:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
value0 = str(items[0])
value1 = str(items[1])
#second index of the inner list
value2 = str(item[1])[:7]
#third index of the list located at 0th
#of the third index of the inner list
value3 = str(item[2][0][2])[:7]
value4 = str(item[2][0][3])[:7]
rows = (value0, value1,value2,value3,value4)
results.append(rows)
labels = ['When bought','Likely bought as well','Support','Confidence','Lift']
movie_suggestion = pd.DataFrame.from_records(results, columns = labels)
print(movie_suggestion)
The output looks like this:
|| When bought| Likely bought as well| Support| Confidence| Lift|
|----|-------------|-------------------------|---------|-----------|---------|
|0 | Red Sparrow| Green Lantern| 0.00573| 0.30069| 0.30069|
|1 |Green Lantern| Star Wars| 0.00586| 0.37288| 0.37288|
|2 |Kung Fu Panda| Jumanji| 0.01599| 0.32345| 0.32345|
|3 | Wonder Woman| Jumanji| 0.00533| 0.37735| 0.37735|
|4 | Spiderman 3| The Spy Who Dumped Me| 0.00799| 0.27149| 0.27149|
|5 | Moana| Spiderman 3| 0.00533| 0.23255| 0.23255|
etc.
Instead of:
|| When bought| Likely bought as well| Support| Confidence| Lift|
|----|-------------|-------------------------|---------|-----------|---------|
|0 | Red Sparrow| Green Lantern| 0.00573| 0.30069| 0.30069|
|1 |Green Lantern| Star Wars| 0.00586| 0.37288| 0.37288|
|2 |Kung Fu Panda| Jumanji| 0.01599| 0.32345| 0.32345|
|3 | Wonder Woman| Jumanji| 0.00533| 0.37735| 0.37735|
|4 | Spiderman 3| The Spy Who Dumped Me| 0.00799| 0.27149| 0.27149|
|5 |Moana, Tomb Rider| Spiderman 3| 0.00533| 0.23255| 0.23255|
I tried looking at all the variables to try to understand the data frame creation to figure out how to get the output i want, but i don't understand it and like I said didn't find anything matching my issue.
I just realized the table at the end of my question did not work as intended...
BUT I figured out a semi-acceptable answer for now.
Using string slicing for value0 and value1 like this:
value0 = str(item[2][0][0])[11:-2]
value1 = str(item[2][0][1])[11:-2]
And the output looks like this (didnt include confidence, support and lift for this):
|When bought |Likely bought as well|
|----------: |--------------------:|
|'Intern','Tomb Rider'|'World War Z' |
and so on
okay in the preview the tables work, but on the post they don't, hope it's still readable

How create new column in Spark using Python, based on other column?

My database contains one column of strings. I'm going to create a new column based on part of string of other columns. For example:
"content" "other column"
The father has two dogs father
One cat stay at home of my mother mother
etc. etc.
I thought to create an array with words who interessed me. For example:
people=[mother,father,etc.]
Then, I iterate on column "content" and extract the word to insert on new column:
def extract_people(df):
column=[]
people=[mother,father,etc.]
for row in df.select("content").collect():
for word in people:
if str(row).find(word):
column.append(word)
break
return pd.Series(column)
f_pyspark = df_pyspark.withColumn('people', extract_people(df_pyspark))
This code don't work and give me this error on the collect():
22/01/26 11:34:04 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 36)
java.lang.OutOfMemoryError: Java heap space
Maybe because my file is too large, have 15 million of row.
How I may make the new column in different mode?
Using the following dataframe as an example
+---------------------------------+
|content |
+---------------------------------+
|Thefatherhas two dogs |
|The fatherhas two dogs |
|Thefather has two dogs |
|Thefatherhastwodogs |
|One cat stay at home of my mother|
|One cat stay at home of mymother |
|Onecatstayathomeofmymother |
|etc. |
|my feet smell |
+---------------------------------+
You can do the following
from pyspark.sql import functions
arr = ["father", "mother", "etc."]
expression = (
"CASE " +
"".join(["WHEN content LIKE '%{}%' THEN '{}' ".format(val, val) for val in arr]) +
"ELSE 'None' END")
df = df.withColumn("other_column", functions.expr(expression))
df.show()
+---------------------------------+------------+
|content |other_column|
+---------------------------------+------------+
|Thefatherhas two dogs |father |
|The fatherhas two dogs |father |
|Thefather has two dogs |father |
|Thefatherhastwodogs |father |
|One cat stay at home of my mother|mother |
|One cat stay at home of mymother |mother |
|Onecatstayathomeofmymother |mother |
|etc. |etc. |
|my feet smell |None |
+---------------------------------+------------+

How to construct a dataframe with LDA in Python

Based on 37,000 article texts, I implemented LDA mallet topic modeling. Each article was properly categorized and the dominant topic of each was determined.
Now I want to create a dataframe that shows each topic's percentages for each article, in Python.
I want the data frame to look like this:
no | Text | Topic_Num_1 | Topic_Num_2 | .... | Topic_Num_25
01 | article text1 | 0.7529 | 0.0034 | .... | 0.0011
02 | article text2 | 0.3529 | 0.0124 | .... | 0.0001
....
(37000 x 27 row)
How would I do this?
+
All the code I've been doing is based on the following site.
http://machinelearningplus.com/nlp/topic-modeling-gensim-python
How can I see the all probability list of the topics of every single article?
Here's a useful link for anyone that has just discovered this question.
I'm also pasting some example code, assuming that you have built a LDA-model and that you want to concatenate the topic-scores to a dataframe df.
import gensim
import numpy as np
lda_model = gensim.models.LdaMulticore(corpus, id2word, num_topics)
lda_scores = lda_model[corpus]
all_topics_csr = gensim.matutils.corpus2csc(lda_scores)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_pandas = pd.DataFrame(all_topics_numpy).reindex(df1.index).fillna(0)
df = pd.concat([df, all_topics_pandas.reindex(df.index)], axis=1, join="inner")

extracting the text after certain value in pandas

I am trying to extract the values in a column which has text data as below:
create date:1953/01/01 | first author:REAGAN RL
How can I extract the author name from the columns and store in a new column.
I tried the following ways:
df.str.extract("first author:(.*?)")
and
authorname=df['EntrezUID'].apply(lambda x:x.split("first author:")). The second one worked.
How can I use the regualr expressions achieve the similar thing
You can do:
## sample data
df = pd.DataFrame({'dd':['create date:1953/01/01 | first author:REAGAN RL','create date:1953/01/01 | first author:MEGAN RL']})
## output
df['names'] = df['dd'].str.extract(r'author\:(.*)')
print(df)
dd names
0 create date:1953/01/01 | first author:REAGAN RL REAGAN RL
1 create date:1953/01/01 | first author:MEGAN RL MEGAN RL

Python : How to collapse the contents of several rows into one cell when importing csv in pandas

I am trying to import a txt file containing radiology reports from patients. Each row is supposed to be a radiology exam (MRI/CT/etc). The original txt file looks something like this:
Name | MRN | DOB | Type_Imaging | Report_Status | Report_Text
John Doe | 1234 | 01/01/1995 | MRI |Complete | Exam Number: A5678
Report status: final
Type: MRI of brain
-----------
REPORT:
HISTORY: History of meningioma, surveillance
FINDINGS: Again demonstrated is a small left frontal parasaggital meningioma, not interval growth. Evidence of cerebrovascular disease unchanged from prior.
Again demonstrated are post-surgical changes associated with prior craniotomy.
[report_end]
James Smith | 5678 | 05/05/1987 |CT | Complete |Exam Number: A8623
Report status: final
Type: CT of chest
-----------
REPORT:
HISTORY: Admitted patient with new fever, concern for pneumonia
FINDINGS: A CT of the chest demostrates bla bla bla
bla bla bla
[report_end]
When I import into pandas using pd.read_csv('filename', sep='|', header=0), the df I get has only "Exam Number: A5678" for report text in the first row. Then, the next row has "Report status: final" in the first cell and the rest of the row has NaN. The third row starts with "Type: MRI of brain" in the first cell and NaN in the rest. etc etc.
It seems like the import is taking both my defined delimiter ('|') and the tabs in the original txt as separators when reading the txt file. There are no '|' within the text of the report.
Is there a way to import this file in a way that collapses all the information between "Exam Number: A5678" and "[report end]" into one cell (the last cell in each row).
Alternatively, I was considering pre-processing this as a text file in order to extract all the Report texts in an iterative manner and append them onto a list that I will eventually be able to add to a df as a column. Looking online and on SO, I haven't been able to find a way to do this when I need to use unique start ("Exam Number:") and end ("[report end]") delimiters for the string of interest. As well as find a way to have the script continue to read the text where it left off (as opposed to just extracting the first report text).
Any thoughts?
Thanks!
Maya
Please be careful that your [report_end] is consistent. You gave both [report_end] and [report end]. I'm assuming that is a typo.
Assuming your file name is test.txt
txt = open('test.txt').read()
names, txt_ = txt.split('\n', 1)
names = names.split('|')
pd.DataFrame(
[t.strip().split('|') for t in txt_.split('[report_end]') if t.strip()],
columns=names)
Name MRN DOB Type_Imaging Report_Status Report_Text
0 John Doe 1234 01/01/1995 MRI Complete Exam Number: A5678\nReport status: final\nTyp...
1 James Smith 5678 05/05/1987 CT Complete Exam Number: A8623\nReport status: final\nType...
I ended up doing this which worked:
import re
import pandas as pd
f = open("filename.txt", "r”)
data = f.read().replace("\n", “”)
matches = re.findall("\|Exam Number:(.*?)\[report_end\]", data, re.DOTALL)
df= pd.read_csv("filename.txt", sep="|", parse_dates=[5]).dropna(axis=0, how="any”)

Categories