Forgive me if I say any improper things.
We keep a project developed in python which makes an analysis of the feeling of the customer comments. The steps are simple:
Extract source comments and dump them into a .csv
Clean comments and dump clean comments on another csv
Analyze the comments
Up the feeling to the bbd
The problem is at point 3 and we don't understand what's going on. The problem is this,
The clean_data_path variable receives the file with the clean comments:
def run(clean_data_path, predictions_data_path):
logger.info("Data analysis started!")
logger.info("Loading clean data...")
df = pd.read_csv(clean_data_path)
logger.info("Loading comments to analyze...")
review_id, _, comments = load(df)
logger.info("Sentiment analysis...")
domain_dict = load_domain_dict()
nlp = NLP(
domain_dict=domain_dict,
domain_dict_cap=conf.DOMAIN_DICT_CAP,
max_review_len=conf.MAX_REVIEW_LEN,
)
words = list(map(nlp.phrase_to_words, comments))
When executing the sentence :
words = list(map(nlp.phrase_to_words, comments))
It just fails to continue and we don't know how to capture the error. We tried to put try i except but it doesn't come out.
If we look at the nlp.phrase_to_words we see the following code:
class NLP():
"""
Class designed to carry out all the actions related with text
manipulation and basic NLP tasks.
"""
def __init__(self, domain_dict, domain_dict_cap, max_review_len):
"""
TODO
"""
self.domain_dict = domain_dict
self.domain_dict_cap = domain_dict_cap
self.max_review_len = max_review_len
self.spacy_nlp = spacy.load('es_core_news_sm')
…
…..
def phrase_to_words(self, phrase):
"""
Given a phrase as a string, it returns a list with the words
in that phrase.
Args
----
phrases : Phrase to tokenize
Returns
-------
List with the words of the phrase
"""
print("before")
doc = self.spacy_nlp(phrase)
print("after", doc)
return list(map(str, doc))
The problem for that is when you call self.spacy_nlp(phrase). Don't run, simply exit.
But the strangest thing is that if I execute the process putting a complete path of the file clean_data_path it works correctly and is able to execute doc = self.spacy_nlp(phrase) (the same file that it uses when it executes the whole process)
The process worked perfectly until a week ago
Someone could give me some guidance or have found something like this
Greetings,
Related
I'm using read_csv from pandas to read in quite a large file - some may be familiar with - its the urban dictionary words and definitions. I have no problem doing this in the REPL, but when coding it into a class and an actual file, iterating over the "list of dictionaries" and accessing the key by name ... it returns NaN of type float see debug info
I have to assume its something with putting it in a class? Maybe I should stick to functions? Anyone help a nub amateur out and clue me in as to what th is going on here? P.S. Rather than take a parameter and read that file in every time, I'm going to just return the dictionary, but this would remain an issue. Here's the code:
def slang(self, term):
"""Attempts to define any given slang words or terms"""
urban = {}
doc = self.folder + '\\urbandict.csv'
data = read_csv(doc, on_bad_lines='skip')
recs = data.to_dict('records')
for rec in recs:
urban[rec['word'].lower()] = rec['definition']
if urban.get(term.lower()) is None:
messagebox.showinfo(
title='No Results', message='Search could not find %s' % term)
else:
meaning = urban.get(term.lower())
messagebox.showinfo(title='Definition', message=meaning)
P.P.S. The folder 📂 it's mapped to is correct... the file contents reads in fine. And the .lower() function isn't the cause, but it is referenced in the traceback. Normally, it'd be a string and it'd work fine. Just weird how it works in REPL but not in the file.
I have been trying to use Stanford Core NLP over a data set but it stops at certain indexes which I am unable to find.
The data set is available on Kaggle: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data
This is a function that outputs the sentiment of a paragraph by taking the mean sentiment value of individual sentences.
import json
def funcSENT(paragraph):
all_scores = []
output = nlp.annotate(paragraph, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only takes in one single sentence.
#"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process faster
"enforceRequirements": "false"
})
all_scores = []
for i in range(0,len(output['sentences'])):
all_scores.append((int(json.loads(output['sentences'][i]['sentimentValue']))+1))
final_score = sum(all_scores)/len(all_scores)
return round(final_score)
Now I run this code for every review in the 'Reviews' column using this code.
import pandas as pd
data_file = 'C:\\Users\\SONY\\Downloads\\Amazon_Unlocked_Mobile.csv'
data = pd.read_csv( data_file)
from pandas import *
i = 0
my_reviews = data['Reviews'].tolist()
senti = []
while(i<data.shape[0]):
senti.append(funcSENT(my_reviews[i]))
i=i+1
But somehow I get this error and I am not able to find the problem. Its been many hours now, kindly help.
[1]: https://i.stack.imgur.com/qFbCl.jpg
How to avoid this error?
As I understand, you're using pycorenlp with nlp=StanfordCoreNLP(...) and a running StanfordCoreNLP server. I won't check the data you are using since it appears to require a Kaggle account.
Running with the same setup but different paragraph shows that printing "output" alone shows an error from the java server, in my case:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Input word not tagged
I THINK that because there is no part-of-speech annotator, the server cannot perform the parsing. Whenever you use parse or depparse, I think you need to have the "pos" annotator as well.
I am not sure what the sentiment annotator needs, but you may need other annotators such as "lemma" to get good sentiment results.
Print output by itself. If you get the same java error, try adding the "pos" annotator to see if you get the expected json. Otherwise, try to give a simpler example, using your own small dataset maybe, and comment or adjust your question.
I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.
So as a bit of a thought experiment I coded up a function in python that uses spaCy to find the subject of a news article, then replace it with a noun of choice. The problem is, it doesn't exactly work well, and I was hoping it could be improved. I don't exactly understand spaCy that well, and the documentation is a bit hard to understand.
First, the code:
doc=nlp(thetitle)
for text in doc:
#subject would be
if text.dep_ == "nsubj":
subject = text.orth_
#iobj for indirect object
if text.dep_ == "iobj":
indirect_object = text.orth_
#dobj for direct object
if text.dep_ == "dobj":
direct_object = text.orth_
try:
subject
except NameError:
if not thetitle: #if empty title
thetitle = "cat"
subject = "cat"
else: #if unknown subject
try: #do we have a direct object?
direct_object
except NameError:
try: #do we have an indirect object?
indirect_object
except NameError: #still no??
subject = random.choice(thetitle.split())
else:
subject = indirect_object
else:
subject = direct_object
else:
thecat = "cat" #do nothing here, everything went okay
newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle)
if (newtitle == thetitle) : #if no replacement happened due to regex
newtitle = thetitle.replace(subject, toreplace)
return newtitle
the "cat" lines are filler lines that don't do anything. "thetitle" is a variable for a random news article title I'm pulling in from RSS feeds. "toreplace" is the variable that holds the string to replace whatever the found subject is.
Let's use an example:
"Video Games that Should Be Animated TV Shows - Screen Rant" And here's the displaCy breakdown of that: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows%20-%20Screen%20Rant&model=en&cpu=1&cph=1
The word the code decided to replace ended up being "that", which isn't even a noun in this sentence, but seems to have resulted in the random word choice fallback, since it couldn't find a subject, indirect object, or direct object. My hope is that it would find something more like "Video games" in this example.
I should note if I take the last bit out (which appears to be the source for the news article) in displaCy: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows&model=en&cpu=1&cph=1 it seems to think "that" is the subject, which is incorrect.
What is a better way to parse this? Should I look for proper nouns first?
Not directly answering your question, I think the code below is far more readable because the conditions are explicit, and what happens when a condition is valid is not buried in an else clause far away. This code also takes care of the cases with multiple objects.
To your problem: any natural language processing tool will have a hard time to find the subject (or maybe rather topic) of a sentence fragment, they are trained with complete sentences. I'm not even sure if such fragments technically have subjects (I'm not an expert, though). You could try to train your own model, but then you will have to provide labeled sentences, I don't know if such a thing already exists for sentence fragments.
I am not fully sure what you want to achieve, looking at the common nouns and pronouns might likely contain the word you want to replace, and the first one appearing is likely the most important.
import spacy
import random
import re
from collections import defaultdict
def replace_subj(sentence, nlp):
doc = nlp(sentence)
tokens = defaultdict(list)
for text in doc:
tokens[text.dep_].append(text.orth_)
if not sentence:
return "cat"
if "nsubj" in tokens:
subject = tokens["nsubj"][0]
elif "dobj" in tokens:
subject = tokens["dobj"][0]
elif "iobj" in tokens:
subject = tokens["iobj"][0]
else:
subject = random.choice(sentence.split())
return re.sub(r"\b{}\b".format(subject), "cat", sentence)
if __name__ == "__main__":
sentence = """Video Games that Should Be Animated TV Shows - Screen Rant"""
nlp = spacy.load("en")
print(replace_subj(sentence, nlp))
I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?
If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.
I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.
To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...