I want to use Standford CoreNLP to pull out Coreferences and start working on the Dependencies of pre-labeled text. I eventually hope to build graph nodes and edges between related Named Entities. I am working in python, but using nltk's java functions to call the "edu.stanford.nlp.pipeline.StanfordCoreNLP" jar directly (which is what nltk does behind the scenes anyway).
My pre-labeled text is in this format:
PRE-LABELED: During his youth, [PERSON: Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16. Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela]. He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety.
What I tried to do is tokenize my sentences myself, building a list of tuples in IOB format: [ ("During","O"), ("his","O"), ("youth","O"), ("Alexander","B-PERSON"), ("III","I-PERSON"), ...]
However, I can't figure out how to tell CoreNLP to take this tuple list as a starting point, building additional Named Entities that weren't initially labeled and finding coreferences on these new, higher-quality tokenized sentences. I obviously tried simply striping out my labels, and letting CoreNLP do this by itself, but CoreNLP is just not as good at finding the Named Entities as the human-tagged pre-labeled text.
I need an output as below. I understand that it will be difficult to use Dependencies to get Edges in this way, but I need to see how far I can get.
DESIRED OUTPUT:
[Person 1]:
Name: Alexander III of Macedon
Mentions:
* "Alexander III of Macedon"; Sent1 [4,5,6,7] # List of tokens
* "Alexander"; Sent2 [6]
* "He"; Sent3 [1]
Edges:
* "Person 2"; "tutored by"; "Aristotle"
[Person 2]:
Name: Aristotle
[....]
How can I feed CoreNLP some pre-identified Named Entities, and still get help with additional Named Entities, with Coreference, and with Basic Dependencies?
P.S. Note that this is not a duplicate of NLTK Named Entity Recognition with Custom Data. I'm not trying to train a new classifier with my pre-labeled NER, I'm only trying to add CoreNLP's to my own when running coreference (including mentions) and dependencies on a given sentence.
The answer is to make a Rules file with Additional TokensRegexNER Rules.
I used a regex to group out the labeled names. From this I built a rules tempfile which I passed to the corenlp jar with -ner.additional.regexner.mapping mytemprulesfile.
Alexander III of Macedon PERSON PERSON,LOCATION,ORGANIZATION,MISC
Aristotle PERSON PERSON,LOCATION,ORGANIZATION,MISC
Anatolia LOCATION PERSON,LOCATION,ORGANIZATION,MISC
Alexander PERSON PERSON,LOCATION,ORGANIZATION,MISC
Persia LOCATION PERSON,LOCATION,ORGANIZATION,MISC
Issus LOCATION PERSON,LOCATION,ORGANIZATION,MISC
Gaugamela LOCATION PERSON,LOCATION,ORGANIZATION,MISC
Persian King Darius III PERSON PERSON,LOCATION,ORGANIZATION,MISC
Achaemenid Empire ORGANIZATION PERSON,LOCATION,ORGANIZATION,MISC
I have aligned this list for readability, but these are tab-separated values.
An interesting finding is that some multi-word pre-labeled entities stay multi-word as originally labeled, whereas running corenlp without the rules files will sometimes split these tokens into separate entities.
I had wanted to specifically identify the named-entity tokens, figuring it would make coreferences easier, but I guess this will do for now. How often are entity names identical but unrelated within one document, anyway?
Example (execution takes ~70secs)
import os, re, tempfile, json, nltk, pprint
from subprocess import PIPE
from nltk.internals import (
find_jar_iter,
config_java,
java,
_java_options,
find_jars_within_path,
)
def ExtractLabeledEntitiesByRegex( text, regex ):
rgx = re.compile(regex)
nelist = []
for mobj in rgx.finditer( text ):
ne = mobj.group('ner')
try:
tag = mobj.group('tag')
except IndexError:
tag = 'PERSON'
mstr = text[mobj.start():mobj.end()]
nelist.append( (ne,tag,mstr) )
cleantext = rgx.sub("\g<ner>", text)
return (nelist, cleantext)
def GenerateTokensNERRules( nelist ):
rules = ""
for ne in nelist:
rules += ne[0]+'\t'+ne[1]+'\tPERSON,LOCATION,ORGANIZATION,MISC\n'
return rules
def GetEntities( origtext ):
nelist, cleantext = ExtractLabeledEntitiesByRegex( origtext, '(\[(?P<tag>[a-zA-Z]+)\:\s*)(?P<ner>(\s*\w)+)(\s*\])' )
origfile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
origfile.write( cleantext.encode('utf-8') )
origfile.flush()
origfile.seek(0)
nerrulefile = tempfile.NamedTemporaryFile(mode='r+b', delete=False)
nerrulefile.write( GenerateTokensNERRules(nelist).encode('utf-8') )
nerrulefile.flush()
nerrulefile.seek(0)
java_options='-mx4g'
config_java(options=java_options, verbose=True)
stanford_jar = '../stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar'
stanford_dir = os.path.split(stanford_jar)[0]
_classpath = tuple(find_jars_within_path(stanford_dir))
cmd = ['edu.stanford.nlp.pipeline.StanfordCoreNLP',
'-annotators','tokenize,ssplit,pos,lemma,ner,parse,coref,coref.mention,depparse,natlog,openie,relation',
'-ner.combinationMode','HIGH_RECALL',
'-ner.additional.regexner.mapping',nerrulefile.name,
'-coref.algorithm','neural',
'-outputFormat','json',
'-file',origfile.name
]
# java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE )
stdout, stderr = java( cmd, classpath=_classpath, stdout=PIPE, stderr=PIPE ) # Couldn't get working- stdin=textfile
PrintJavaOutput( stdout, stderr )
origfilenametuple = os.path.split(origfile.name)
jsonfilename = origfilenametuple[len(origfilenametuple)-1] + '.json'
os.unlink( origfile.name )
os.unlink( nerrulefile.name )
origfile.close()
nerrulefile.close()
with open( jsonfilename ) as jsonfile:
jsondata = json.load(jsonfile)
currentid = 0
entities = []
for sent in jsondata['sentences']:
for thisentity in sent['entitymentions']:
tag = thisentity['ner']
if tag == 'PERSON' or tag == 'LOCATION' or tag == 'ORGANIZATION':
entity = {
'id':currentid,
'label':thisentity['text'],
'tag':tag
}
entities.append( entity )
currentid += 1
return entities
#### RUN ####
corpustext = "During his youth, [PERSON:Alexander III of Macedon] was tutored by [PERSON: Aristotle] until age 16. Following the conquest of [LOCATION: Anatolia], [PERSON: Alexander] broke the power of [LOCATION: Persia] in a series of decisive battles, most notably the battles of [LOCATION: Issus] and [LOCATION: Gaugamela]. He subsequently overthrew [PERSON: Persian King Darius III] and conquered the [ORGANIZATION: Achaemenid Empire] in its entirety."
entities = GetEntities( corpustext )
for thisent in entities:
pprint.pprint( thisent )
Output
{'id': 0, 'label': 'Alexander III of Macedon', 'tag': 'PERSON'}
{'id': 1, 'label': 'Aristotle', 'tag': 'PERSON'}
{'id': 2, 'label': 'his', 'tag': 'PERSON'}
{'id': 3, 'label': 'Anatolia', 'tag': 'LOCATION'}
{'id': 4, 'label': 'Alexander', 'tag': 'PERSON'}
{'id': 5, 'label': 'Persia', 'tag': 'LOCATION'}
{'id': 6, 'label': 'Issus', 'tag': 'LOCATION'}
{'id': 7, 'label': 'Gaugamela', 'tag': 'LOCATION'}
{'id': 8, 'label': 'Persian King Darius III', 'tag': 'PERSON'}
{'id': 9, 'label': 'Achaemenid Empire', 'tag': 'ORGANIZATION'}
{'id': 10, 'label': 'He', 'tag': 'PERSON'}
Related
I am trying out a simple text-matching activity where I scraped titles of blog posts and try to match it with my pre-defined categories once I find specific keywords.
So for example, the title of the blog post is
"Capture Perfect Night Shots with the Oppo Reno8 Series"
Once I ensure that "Oppo" is included in my categories, "Oppo" should match with my "phone" category like so:
categories = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}
Then my dataframe would look like this
Fortunately I found a reference to how to use regex in mapping to nested dictionaries but it can't seem to work past the first couple of words
Reference is here
So once I use the code
def put_category(cats, text):
regex = re.compile("(%s)" % "|".join(map(re.escape, categories.keys())))
if regex.search(text):
ret = regex.search(text)
return ret[0]
else:
return 'general'
It usually reverts to put "general" as the category, even when doing it in lowercase as seen here
I'd prefer to use the current method of inputting values inside the dictionary for this matching activity instead of running pure regex patterns and then putting it through fuzzy matching for the result.
You can create a reverse mapping that maps keywords to categories instead, so that you can efficiently return the corresponding category when a match is found:
mapping = {keyword: category for category, keywords in categories.items() for keyword in keywords}
def put_category(mapping, text):
match = re.search(rf'\b(?:{"|".join(map(re.escape, mapping))})\b', text, re.I)
if match:
return mapping[match[0].lower()]
return 'general'
print(put_category(mapping, "Capture Perfect Night Shots with the Oppo Reno8 Series"))
This outputs:
phone
Demo: https://replit.com/#blhsing/BlandAdoredParser
In this case, you are matching exact words, and not patterns. You can do it without regular expressions.
Going back to your example:
import pandas as pd
CAT_DICT = {"phone" : ['apple', 'oppo', 'xiaomi', 'samsung', 'huawei', 'nokia'],
"postpaid" : ['signature', 'postpaid'],
"prepaid" : ['power all', 'giga'],
"sku" : ['data', 'smart bro'],
"ewallet" : ['gigapay'],
"event" : ['gigafest'],
"software" : ['ios', 'android', 'macos', 'windows'],
"subculture" : ['anime', 'korean', 'kpop', 'gaming', 'pop', 'culture', 'lgbtq', 'binge', 'netflix', 'games', 'ml', 'apple music'],
"health" : ['workout', 'workouts', 'exercise', 'exercises'],
"crypto" : ['axie', 'bitcoin', 'coin', 'crypto', 'cryptocurrency', 'nft'],
"virtual" : ['metaverse', 'virtual']}
df = pd.DataFrame({"title": [
"Capture Perfect Night Shots with the Oppo Reno8 Series",
"Personal is Powerful: Why Apple's iOS 16 is the Smartest update"
]})
You can define this function to assign categories to each title:
def assign_cat(title: str, cat_dict: dict[str, list[str]]) -> list[str]:
title_low = title.lower()
categories = list()
for c,words in cat_dict.items():
if any([w in title_low for w in words]):
categories.append(c)
if len(categories) == 0:
categories.append("general")
return categories
The key part is here: any([w in title_low for w in words]). For each word in your category, you are checking if it is present in the title (lowercase). And if ANY of the words is present, you associate the category to it.
You get:
The advantage of this approach is that a title can have multiple categories assigned to it (see the 2nd title)
I am trying to extract a specific field "Engineering Lead" and its corresponding value from the JSON text but however, when tried to extract it directly from the JSON, it is throwing the key error as shown in the code1 . Since it is not working, i have decided to loop it to fetch the key Engineering Lead and it is value but it still throwing the same error. any help would be aprreciated.
json text:
{'expand': 'renderedFields,names,schema,operations,editmeta,changelog,versionedRepresentations', 'id': '11659640', 'self': '/rest/api/2/issue/11659640', 'key': 'TOOLSTEST-2651', 'fields': {'description': 'h2. Main\r\n * *:*\r\n * *Application ISO:*\xa0Tony Zeinoun\r\n * *Engineering Lead:*\xa0Peter james\r\n * *Application Architect:*\xa0John david\r\n * *Divisional Architect:*\xa0Robert denuvit'}}
code 1:
engLeadDetails = data_load['fields']['* \*Engineering Lead']
Code 2:
engLeadDetails = data_load['fields']
for k,v in engLeadDetails.items():
if (k == '* \*Engineering Lead'):
print (v)
Error:
Traceback (most recent call last):
File "/Users/peter/abc.py", line 32, in <module>
engLeadDetails = data_load['fields']['* *Engineering Lead']
KeyError: '* *Engineering Lead'
I think python can't find such key because of some missing quotes. Please check the json text once more. It seems like * *Engineering Lead is currently a part of a bigger string, but not a key (due to missing quotes).
KeyError means that the key * \*Engineering Lead doesn't exist in the dictionary.
It appears the delimiter of the description (where your EngLead is stored) is \r\n.
Using this we can split the description to get each role.
job_details = data_load["fields"]["description"]
Removing arbitrary strings, this leaves us with
job_details = [
"* *Application ISO:* Tony Zeinoun",
"* *Engineering Lead:* Peter james",
"* *Application Architect:* John david",
"* *Divisional Architect:* Robert denuvit",
]
I am assuming you want the name of the person in each position.
Now we remove arbitrary characters from each string.
job_dict = {}
for s in job_details:
s = s.replace("*","").strip()
job, person = s.split(":")
job_dict[job] = person.strip()
job_dict is now clean, with easy key access to each job.
Resultant Dict:
{
'Application Architect': 'John david',
'Application ISO': 'Tony Zeinoun',
'Divisional Architect': 'Robert denuvit',
'Engineering Lead': 'Peter james'
}
print(job_dict["Engineering Lead"]) # Peter james
You can convert the JSON string description into a dictionary by splitting on \r\n sequence then break the role and name into key/values and add to a dictionary.
The expression \W* in regexp below will strip off the non-alphanumeric prefix off the roles and names; e.g., "** Application ISO" => "Application ISO", etc.
Try something like this:
data = {}
for s in data_load['fields']['description'].split('\r\n'):
if m := re.search(r'^\W*(.*?):\W*(.+)', s):
if label := m.group(1):
data[label] = m.group(2)
print(data)
Output:
{'Application ISO': 'Tony Zeinoun', 'Engineering Lead': 'Peter james', 'Application Architect': 'John david', 'Divisional Architect': 'Robert denuvit'}
Then can grab a particular role/person out:
print(">>", data.get("Engineering Lead"))
Outputs:
>> Peter james
I got a list in Python with Twitter user information and exported it with Pandas to an Excel file.
One row is one Twitter user with nearly all information of the user (name, #-tag, location etc.)
Here is my code to create the list and fill it with the user data:
def get_usernames(userids, api):
fullusers = []
u_count = len(userids)
try:
for i in range(int(u_count/100) + 1):
end_loc = min((i + 1) * 100, u_count)
fullusers.extend(
api.lookup_users(user_ids=userids[i * 100:end_loc])
)
print('\n' + 'Done! We found ' + str(len(fullusers)) + ' follower in total for this account.' + '\n')
return fullusers
except:
import traceback
traceback.print_exc()
print ('Something went wrong, quitting...')
The only problem is that every row is in JSON object and therefore one long comma-seperated string. I would like to create headers (no problem with Pandas) and only write parts of the string (i.e. ID or name) to colums.
Here is an example of a row from my output.xlsx:
User(_api=<tweepy.api.API object at 0x16898928>, _json={'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')
I have two ideas, but I don't know how to realize them due to my lack of skills and experience with Python.
Create a loop which saves certain parts ('id','name' etc.) from the JSON-string in colums.
Cut off the User(_api=<tweepy.api. API object at 0x16898928>, _json={ at the beginning and ) at the end, so that I may export they file as CSV.
Could anyone help me out with one of my two solutions or suggest a "simple" way to do this?
fyi: I want to do this to gather data for my thesis.
Try the python json library:
import json
jsonstring = "{'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')"
jsondict = json.loads(jsonstring)
# type(jsondict) == dictionary
Now you can just extract the data you want from it:
id = jsondict["id"]
name = jsondict["name"]
newdict = {"id":id,"name":name}
As an alternative to accomplishing this: Patterns with multi-terms entries in the IN attribute
I wrote the following code to match phrases, label them, and then use them in EntityRuler patterns:
# %%
import spacy
from spacy.matcher import PhraseMatcher
from spacy.pipeline import EntityRuler
from spacy.tokens import Span
class PhraseRuler(object):
name = 'phrase_ruler'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
nlp = spacy.load("en_core_web_lg")
entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
nlp.add_pipe(entity_matcher, before="ner")
ruler = EntityRuler(nlp)
patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"ENT_TYPE": "GREETING"}]}]
ruler.add_patterns(patterns)
#ruler.to_disk("./data/patterns.jsonl")
nlp.add_pipe(ruler)
print(nlp.pipe_names)
doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
print(doc.to_json())
Unfortunately this does not work as it does not return my SUPER_GREETING:
'ents': [
{'start': 0, 'end': 4, 'label': 'PERSON'},
{'start': 10, 'end': 21, 'label': 'GREETING'},
{'start': 39, 'end': 51, 'label': 'GREETING'}
]
What am I doing wrong? How do I fix it?
You have the right idea, but the problem here is an intrinsic design choice in spaCy that any token can only be part of one named entity. So you can't have "Warm Welcome" being both a "GREETING" as well as part of a "SUPER_GREETING".
One way you could work around this is by using custom extensions. For instance, one solution would be to store the GREETING bit on the token level:
Token.set_extension("mylabel", default="")
And then we adjust the PhraseRuler.__call__ so that it doesn't write to doc.ents but instead does this:
for token in span:
token._.mylabel = "MY_GREETING"
Now, we can rewrite the SUPER_GREETING pattern to:
patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]
which will match "super" followed by one or more "MY_GREETING" tokens. It will match greedily and output "super Warm Welcome" as hit.
Here's the resulting code snippet, starting from your code and making the adjustements as described:
Token.set_extension("mylabel", default="")
class PhraseRuler(object):
name = 'phrase_ruler'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
for label, start, end in matches:
span = Span(doc, start, end, label=label)
for token in span:
token._.mylabel = "MY_GREETING"
return doc
nlp = spacy.load("en_core_web_lg")
entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
nlp.add_pipe(entity_matcher, name="entity_matcher", before="ner")
ruler = EntityRuler(nlp)
patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, after="entity_matcher")
print(nlp.pipe_names)
doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
print("TOKENS:")
for token in doc:
print(token.text, token._.mylabel)
print()
print("ENTITIES:")
for ent in doc.ents:
print(ent.text, ent.label_)
Which outputs
TOKENS:
Mary
said
Best MY_GREETING
Wishes MY_GREETING
and
I
said
super
Warm MY_GREETING
Welcome MY_GREETING
.
ENTITIES:
Mary PERSON
super Warm Welcome SUPER_GREETING
This may not be exactly what you need/want - but I hope it helps you move forward with an alternative solution for your specific use-case. If you do want the normal "GREETING" spans in the final doc.ents, maybe you can reassemble them in post-processing, after the EntityRuler has run, e.g. by moving the custom attributes to doc.ents if they don't overlap, or by keeping a cache of the spans somewhere.
I have created a script which scrapes many pdfs for abstract and keywords. I also have a collection of bibtex-files in which I want to place the texts I've extracted. What I'm looking for is a way of adding elements to the bibtex files.
I have written a short parser:
#!/usr/bin/python
#-*- coding: utf-8
import os
from pybtex.database.input import bibtex
dir_path = "nime_archive/nime/bibtex/"
num_texts = 0
class Bibfile:
def __init__(self,bibs):
self.bibs = bibs
for a in self.bibs.entries.keys():
num_text += 1
print bibs.entries[a].fields['title']
#Need to implement a way of getting just the nime-identificator
try:
print bibs.entries[a].fields['url']
except:
print "couldn't find URL for text: %s " % a
print "creating new bibfile"
bibfiles = []
parser = bibtex.Parser()
for infile in os.listdir(dir_path):
if infile.endswith(".bib"):
print infile
bibfiles = Bibfile(parser.parse_file(dir_path+infile))
My question is if there is possible to use Pybtex to add elements into the existing bibtex-files (or create a copy) so I can merge my extractions with what is already available. If this is not possible in Pybtex, what other bibtex parser can I use?
I've never used pybtex, but from a quick glance, you can add entries. Since self.bibs.entries appears to be a dict, you can come up with a unique key, and add more entries to it. Example:
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
self.bibs.entries[key] = new_entry
(caveat: untested)
If you wonder where I got this example form: have a look in the tests/ subdirectory of the source of pybtex. I got the above code example mainly from tests/database_test/data.py. Tests can be a good source of documentation if the actual documentation is lacking.
.data.add_entry(key, entry) works for me. Here I used an entry manually created (taken from Evert's example) but you can copy an existing entry from another bib that you're also parsing.
from pybtex.database.input.bibtex import Parser
from pybtex.core import Entry, Person
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
newbib_parser = Parser()
newbib_parser.data.add_entry(key, new_entry)
print newbib_parser.data