sumy LexRankSummarizer() proper formatting of output text

sumy LexRankSummarizer() proper formatting of output text - python

I am trying to get the output as string using LexRankSummarizer in sumy library.
I am using the following code (pretty straightforward)
parser = PlaintextParser.from_string(text,Tokenizer('english'))
summarizer = LexRankSummarizer()
sum_1 = summarizer(parser.document,10)
sum_lex=[]
for sent in sum_1:
sum_lex.append(sent)
using the above code I am getting an output which is in the form of tuple. Consider a summary as given below from a text as input
The Mahājanapadas were sixteen kingdoms or oligarchic republics that existed in ancient India from the sixth to fourth centuries BCE.
Two of them were most probably ganatantras (republics) and others had forms of monarchy.
Using the above code I am getting an output as
sum_lex = [<Sentence: The Mahājanapadas were sixteen kingdoms or oligarchic republics that existed in ancient India from the sixth to fourth centuries BCE.>,
<Sentence: Two of them were most probably ganatantras (republics) and others had forms of monarchy.>]
However, if I use print(sent) I am getting proper output as given above.
How to tackle this issue?

Replacing sum_lex.append(sent) with sum_lex.append(str(sent)) should do.

Related

how to define sample in a natural language processing model

for doc in sample['documents']:
The error is 'sample' undefined (I was trying to reproduce a natural language processing model)

I this you are searching that the way to display the natural processing language and i this this is helpful to you. i mention the link below so please come check this..
https://www.tableau.com/learn/articles/natural-language-processing-examples

In this case, your problem is the way you are reading the input. Not big deal no worries !
In the loop:
for doc in sample['documents']
sample is the Dataframe of input, or a dictionary, and 'documents' is the name of the column.
Let's suppose I have a csv of input like the following:
documents,label
Being offensive isnt illegal you idiot, negative
Loving the first days of summer! <3, positive
I hate when people put lol when we are having a serious talk ., negative
in python you will read the csv using pandas dataframe, for example:
sample=pd.read_csv('inputdata.csv',header=0)
and your sample['documents'] is the first colum of the input file. header =0 means that the label of your column are specified at the first line of the csv.
for doc in sample['documents'] will iterate over the first column, like this:
Being offensive isnt illegal you idiot
Loving the first days of summer! <3
I hate when people put lol when we are having a serious talk
This means that maybe the origin of your error is that you call your input data in some other ways instead of sample or it is not reading the header of the csv input.
If the csv doesn't have documents as the name of the header you can specify it like this:
columns = ['documents', 'labels']
sample = pd.read_csv(inputdata.csv', header = None, names = columns)
sample
Hope it helps !

If an input string too id long or has a paragraphs won't copy it all

I have a question about input
description = input('add description: ')
I'm adding a text using Ctrl+C and Ctrl+V.
For example:
"The short story is a crafted form in its own right. Short stories
make use of plot, resonance, and other dynamic components as in a
novel, but typically to a lesser degree. While the short story is
largely distinct from the novel or novella/short novel, authors
generally draw from a common pool of literary techniques.
Determining what exactly separates a short story from longer fictional
formats is problematic. A classic definition of a short story is that
one should be able to read it in one sitting, a point most notably
made in Edgar Allan Poe's essay "The Philosophy of Composition"
(1846)"
Result is:
description = "The short story is a crafted form in its own right. Short stories make use of plot, resonance, and other dynamic components as in a novel, but typically to a lesser degree. While the short story is largely distinct from the novel or novella/short novel, authors generally draw from a common pool of literary techniques."
Whilst I want description to hold the entire text chain I copied.

Normally the input() function terminates on an End Of Line or \n. I would suggest using a setup like this:
line = []
while True:
line = input()
if line == "EOF":
break
else:
lines.append(line)
text = ' '.join(lines)
What this does is read input and add it to a array until you type in "EOF" on its own line and hit enter. Thsis should solve the multi line problem.

The problem you're facing here is that an input ends as soon as enter is hit or (in this case) the next line is started. The only way to use enter (I'm just going to call It that, hope you know what I mean) is to instead of actually writing a new paragraph just to write \n, since that is the representation of enter in a string. If you want to go around this issue though I highly recommend you learn how to use the TKinter model, since if you want to create any kind of app for frontend It is one of the best modules. Here a link to get you started https://www.tutorialspoint.com/python/python_gui_programming.htm

Python valid values?

I am trying to fix older code someone wrote years ago using python. I believe the "\d\d\d\d" refers to the number of text characters, and 0-9A-Z limits the type of input but I can't find any documentation on this.
idTypes = {"PFI":"\d\d\d\d",
"VA HOSPITAL ID":"V\d\d\d",
"CERTIFICATION NUMBER":"\d\d\d-[A-Z]-\d\d\d",
"MORTUARY FIRM ID":"[0-9]",
"HEALTH DEPARTMENT ID":"[0-9]",
"NYSDOH OFFICE ID":"[0-9]",
"ACF ID":"AF\d\d\d\d",
"GENERIC NUMBER ID":"[0-9]",
"GENERIC ID":"[A-Za-z0-9]",
"OASAS FAC":"[0-9]",
"OMH PSYCH CTR":"[0-9A-Z]"}
Like the PFI values seem to be limited to 4 numeric digits in a string field, so 12345 doesn't work later in the code but 1234 does. Adding another \d doesn't appear to be the answer.

These are, apparently, regular expressions used to validate inputs. See https://docs.python.org/2/library/re.html
Without seeing the code that uses these values it is impossible to say more.

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

match hex string with list indice

I'm building a de-identify tool. It replaces all names by other names.
We got a report that <name>Peter</name> met <name>Jane</name> yesterday. <name>Peter</name> is suspicious.
outpout :
We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.
It can be done on multiple documents, and one name is always replaced by the same counterpart, so you can still understand who the text is talking about. BUT, all documents have an ID, referring to the person this file is about (I'm working with files in a public service) and only documents with the same people ID will be de-identified the same way, with the same names. (the goal is to watch evolution and people's history) This is a security measure, such as when I hand over the tool to a third party, I don't hand over the key to my own documents with it.
So the same input, with a different ID, produces :
We got a report that <name>Henry</name> met <name>Alicia</name> yesterday. <name>Henry</name> is suspicious.
Right now, I'm hashing each name with the document ID as a salt, I convert the hash to an integer, then subtract the length of the name list until I can request a name with that integer as an indice. But I feel like there should be a quicker/more straightforward approach ?
It's really more of an algorithmic question, but if it's of any relevance I'm working with python 2.7 Please request more explanation if needed. Thank you !
I hope it's clearer this way ô_o Sorry when you are neck-deep in your code you forget others need a bigger picture to understand how you got there.

As #LutzHorn pointed out, you could just use a dict to map real names to false ones.
You could also just do something like:
existing_names = []
for nameocurrence in original_text:
if not nameoccurence.name in existing_names:
nameoccurence.id = len(existing_names)
existing_names.append(nameoccurence.name)
else:
nameoccurence.id = existing_names.index(nameoccurence.name)
for idx, _ in enumerate(existing_names):
existing_names[idx] = gimme_random_name()

Try using a dictionary of names.
import re
names = {"Peter": "Billy", "Jane": "Elsa"}
for name in re.findall("<name>([a-zA-Z]+)</name>", s):
s = re.sub("<name>" + name + "</name>", "<name>"+ names[name] + "</name>", s)
print(s)
Output:
'We got a report that <name>Billy</name> met <name>Elsa</name> yesterday. <name>Billy</name> is suspicious.'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sumy LexRankSummarizer() proper formatting of output text - python

Replacing sum_lex.append(sent) with sum_lex.append(str(sent)) should do.

Related

how to define sample in a natural language processing model

If an input string too id long or has a paragraphs won't copy it all

Python valid values?

Search methods and string matching in python

match hex string with list indice

Categories

Resources