Text processing and detection from a specific dictionary in python

Text processing and detection from a specific dictionary in python - python

I have a text in English that I want to process to detect specific entries that I have in another dictionary in Python (example entry: mass spectroscopy). Those entries are very important as they need to be matched for later annotation. In order to do that I need to either add many forms of each entry (like plurals, acronyms etc.) or find a way to do the intelligent processing . Not only does the brute approach take much more time (for me), but I might not be able to resolve all the situations (I want mass spectroscopy, possibly spectroscopy, but not mass). I am not looking for a solutions, I just need guidelines how to approach the problem and which toolkit to use. The dictionary is growing and an intelligent approach would be preferred.
I have found NLTK in Python, but I am not sure how to use my dict in addition or instead of the built-in corpora.
Example - I have a sentence:
[u'Liquid', u'biopsies', u'based', u'on', u'circulating', u'cell-free', u'DNA', u'(cfDNA)', u'analysis', u'are', u'described', u'as', u'surrogate', u'samples', u'for', u'molecular', u'analysis.']
I have a dict with {'Liquid biopsy':['Blood for analysis'],'cfDNA':['Blood for analysis']}. The arrays are used on purpose so they are both the same object, thus trying to create aliases in a dict.
How do I match my entries to the text?
Thanks in advance!

if i didn't misunderstand you, you want to check the dictionary items with the list items. Then print the results to the console.
dict_1={"Liquid Biopsy":"Blood for analysis","cfDNA":"Blood for analysis","Liquid Biopsies":"Blood for analysis"}
list_1=[u'Liquid', u'biopsies', u'based', u'on', u'circulating', u'cell-free', u'DNA', u'(cfDNA)', u'analysis', u'are', u'described', u'as', u'surrogate', u'samples', u'for', u'molecular', u'analysis.']
string_1=" ".join(list_1).lower()
for i in dict_1:
if i.lower() in string_1:
print("Key: {}\nValue: {}\n".format(i,dict_1[i]))
I used the above codes and the console printed the below results.
Key: Liquid Biopsies
Value: Blood for analysis
Key: cfDNA
Value: Blood for analysis
Process finished with exit code 0

Related

How to find the root of a word from its present participle or other variations in Python?

I'm working on a NLP project, and right now, I'm stuck on detecting antonyms for certain phrases that aren't in their "standard" forms (like verbs, adjectives, nouns) instead of present-participles, past tense, or something to that effect. For instance, if I have the phrase "arriving" or "arrived", I need to convert it to "arrive". Similarly, "came" should be "come". Lastly, “dissatisfied” should be “dissatisfy”. Can anyone help me out with this? I have tried several stemmers and lemmanizers in NLTK with Python, to no avail; most of them don’t produce the correct root. I’ve also thought about the ConceptNet semantic network and other dictionary APIs, but it seems far too complicated for what I need. Any advice is helpful. Thanks!

If you know you'll be working with a limited set, you could create a dictionary.
Example :
look_up = {'arriving' : 'arrive',
'arrived' : 'arrive',
'came' : 'come',
'dissatisfied' : 'dissatisfy'}
test = 'arrived'
print (look_up [test])

Extract keywords/phrases from a given short text using python and its libraries

From a user given input of job description, i need to extract the keywords or phrases, using python and its libraries. I am open for suggestions and guidance from the community of what libraries work best and if in case, its simple, please guide through.
Example of user input:
user_input = "i want a full stack developer. Specialization in python is a must".
Expected output:
keywords = ['full stack developer', 'python']

Well, a good keywords set is a good method. But, the key is how to build it. There are many way to do it.
Firstly, the simplest one is searching open keywords set in the web. It's depend on your luck and your knowledge. Your keywords (likes "python, java, machine learing") are common tags in Stackoverflow, Recruitment websites. Don't break the law!
The second one is IR(Information Extraction), it's more complex than the last one. There are many algorithms, likes "TextRank", "Entropy", "Apriori", "HMM", "Tf-IDF", "Conditional Random Fields", and so on.
Good lucky.
For matching keywords/phases, Trie Tree is more faster.

Well, i answered my own question. Thanks anyways for those who replied.
keys = ['python', 'full stack developer','java','machine learning']
keywords = []
for i in range(len(keys)):
word = keys[i]
if word in keys:
keywords.append(word)
else:
continue
print(keywords)
Output was as expected!

What's the equivalent of this Ruby code in Python?

#Ruby Code
si=1
gw=0
linkedNodes=[{1=>2},{1=>0}]
puts "found node" if linkedNodes.include?({si=>gw})
I'm trying to figure out if there's way to do something like this in Python. I'm searching an array of hashes for a match on the complete hash, which is incredibly easy to do in Ruby using the
include?()
method. I found a lot of information about searching lists for hash by key or by value but I'm trying to match the entire hash (key and value). I read about a filter option using a lambda but that quickly turned into a hot mess when I started getting exceptions and playing around with try: except: blocks.

Assuming you meant linkedNodes = [{1 => 2}, {1 => 0}], this is a literal translation to python:
>>> si=1
>>> gw=0
>>> linkedNodes = [{1:2},{1:0}]
>>> if {si:gw} in linkedNodes:
... print("found node")
#⇒ found node

Common or garden variety Python doesn't have anything like that. You'd do it in two steps: first check the key, then the value:
if si in linkedNodes and linkedNodes[si] == gw:
# do whatever
and is short-circuiting, so if si is not a key in linkedNodes, linkedNodes[si] == gw is not evaluated; you can't get an error trying to access that element in that case.
If you want to do so, you could create a dict subclass where in behaves that way (or does so optionally). This I'll leave as an exercise.

Generating search term suggestions with Whoosh?

I've got a set of documents in a Whoosh index, and I want to provide a search term suggestion feature. So If you type "pop", some suggestions that could come up might be:
popcorn
popular
pope
Poplar Film
pop culture
I've got the terms that should be coming up as suggestions going into an NGRAMWORDS field in my index, but when I do a query on that field I get autocompleted results rather than the expanded suggestions - so I get documents tagged with "pop culture", but no way to show that term to the user.
(For comparison, I'd do this in ElasticSearch using a completion mapping on that field and then use the _suggest endpoint to get the suggestions.)
I can only find examples for autocomplete or spelling correction in the documentation or elsewhere on on the web. Is there any way I can get search term suggestions from my index with Whoosh?
Edit:
expand_prefix was a much-needed pointer in the right direction. I've ended up using a KEYWORD(commas=True, lowercase=True) for my suggest field, and code like this to get suggestions in most-common-first order (expand_prefix and iter_prefix will yield them in alphabetical order):
def get_suggestions(term):
with ix.reader() as r:
suggestions = [(s[0], s[1].doc_frequency()) for s in r.iter_prefix('suggest', term)]
return sorted(suggestions, key=itemgetter(1), reverse=True)

Term Frequency Functions
I want to add to the answers here that there is actually a builtin function in whoosh that returns the top 'number' terms by term frequency. It is in the whoosh docs.
whoosh.reading.IndexReader.most_frequent_terms(fieldname, number=5, prefix='')
tf-idf vs. frequency
Also, on the same page of the docs, right above the previous function in the whoosh docs is a function that returns the most distinctive terms rather than the most frequent. It uses the tf-idf score, which is effective at eliminating common but insignificant words like 'the'. This could be more or less useful depending on what you are looking for. it is appropriately named:
whoosh.reading.IndexReader.most_distinctive_terms(fieldname, number=5, prefix='')
Each of these would be used in this fashion:
with ix.reader() as r:
print r.most_frequent_terms('suggestions', number=5, prefix='pop')
print r.most_distinctive_terms('suggestions', number=5, prefix='pop')
Multi-Word Suggestions
As well, I have had problems with multi-word suggestions. My solution was to create a schema in the following way:
fields.Schema(suggestions = fields.TEXT(),
suggestion_phrases = fields.KEYWORD(commas=True, lowercase=True)
In the suggestion_phrases field, commas=True allows keywords to be stored with spaces and therefore have multiple words, and lowercase=True ignores capitalization (This can be removed if it is necessary to distinguish between capitalized and non-capitalized terms). Then, in order to get both single and multi-word suggestions, you would run either most_frequent_terms() or most_distinctive_terms() on both fields. Then combine the results.

This is not what you are looking for exactly, but probably can help you:
reader = index.reader()
for x in r.expand_prefix('title', 'pop'):
print x
Output example:
pop
popcorn
popular
Update
Another workaround is to build another index with keywords as TEXT only. And play with search language. What I could achieve:
In [12]: list(ix.searcher().search(qp.parse('pop*')))
Out[12]:
[<Hit {'keywords': u'popcorn'}>,
<Hit {'keywords': u'popular'}>,
<Hit {'keywords': u'pope'}>,
<Hit {'keywords': u'Popular Film'}>,
<Hit {'keywords': u'pop culture'}>]

Using Strings to Name Hash Keys?

I'm working through a book called "Head First Programming," and there's a particular part where I'm confused as to why they're doing this.
There doesn't appear to be any reasoning for it, nor any explanation anywhere in the text.
The issue in question is in using multiple-assignment to assign split data from a string into a hash (which doesn't make sense as to why they're using a hash, if you ask me, but that's a separate issue). Here's the example code:
line = "101;Johnny 'wave-boy' Jones;USA;8.32;Fish;21"
s = {}
(s['id'], s['name'], s['country'], s['average'], s['board'], s['age']) = line.split(";")
I understand that this will take the string line and split it up into each named part, but I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes.
The purpose of the individual parts is to be searched based on an individual element and then printed on screen. For example, being able to search by ID number and then return the entire thing.
The language in question is Python, if that makes any difference. This is rather confusing for me, since I'm trying to learn this stuff on my own.
My personal best guess is that it doesn't make any difference and that it was personal preference on part of the authors, but it bewilders me that they would suddenly change form like that without it having any meaning, and further bothers me that they don't explain it.
EDIT: So I tried printing the id key both with and without single quotes around the name, and it worked perfectly fine, either way. Therefore, I'd have to assume it's a matter of personal preference, but I still would like some info from someone who actually knows what they're doing as to whether it actually makes a difference, in the long run.
EDIT 2: Apparently, it doesn't make any sense as to how my Python interpreter is actually working with what I've given it, so I made a screen capture of it working https://www.youtube.com/watch?v=52GQJEeSwUA

I don't understand why what I think are keys are being named by using a string, when just a few pages prior, they were named like any other variable, without single quotes
The answer is right there. If there's no quote, mydict[s], then s is a variable, and you look up the key in the dict based on what the value of s is.
If it's a string, then you look up literally that key.
So, in your example s[name] won't work as that would try to access the variable name, which is probably not set.
EDIT: So I tried printing the id key both with and without single
quotes around the name, and it worked perfectly fine, either way.
That's just pure luck... There's a built-in function called id:
>>> id
<built-in function id>
Try another name, and you'll see that it won't work.

Actually, as it turns out, for dictionaries (Python's term for hashes) there is a semantic difference between having the quotes there and not.
For example:
s = {}
s['test'] = 1
s['othertest'] = 2
defines a dictionary called s with two keys, 'test' and 'othertest.' However, if I tried to do this instead:
s = {}
s[test] = 1
I'd get a NameError exception, because this would be looking for an undefined variable called test whose value would be used as the key.
If, then, I were to type this into the Python interpreter:
>>> s = {}
>>> s['test'] = 1
>>> s['othertest'] = 2
>>> test = 'othertest'
>>> print s[test]
2
>>> print s['test']
1
you'll see that using test as a key with no quotes uses the value of that variable to look up the associated entry in the dictionary s.
Edit: Now, the REALLY interesting question is why using s[id] gave you what you expected. The keyword "id" is actually a built-in function in Python that gives you a unique id for an object passed as its argument. What in the world the Python interpreter is doing with the expression s[id] is a total mystery to me.
Edit 2: Watching the OP's Youtube video, it's clear that he's staying consistent when assigning and reading the hash about using id or 'id', so there's no issue with the function id as a hash key somehow magically lining up with 'id' as a hash key. That had me kind of worried for a while.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.