I created a csv file like this:
"CAMERA", "Camera", "kamera", "cam", "Kamera"
"PICTURE", "Picture", "bild", "photograph"
and used it somewhat like this:
nlp = de_core_news_sm.load()
text = "Cam is not good"
doc = nlp(text)
name_dict, desc_dict = load_entities()
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=96)
for qid, desc in desc_dict.items():
desc_doc = nlp(desc)
desc_enc = desc_doc.vector
kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342) # 342 is an arbitrary value here
for qid, name in name_dict.items():
kb.add_alias(alias=name, entities=[qid], probabilities=[1]) # 100% prior probability P(entity|alias)
Printing values like this:
print(f"Entities in the KB: {kb.get_entity_strings()}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
gives me:
Entities in the KB: ['PICTURE', 'CAMERA']
Aliases in the KB: [' "Camera"', ' "Picture"']
However, if I try to check for candidates, I only get an empty list:
candidates = kb.get_candidates("Camera")
print(candidates)
for c in candidates:
print(" ", c.entity_, c.prior_prob, c.entity_vector)
Aliases in the KB: [' "Camera"', ' "Picture"']
It looks to me as if your parsing script added the literal string "Camera", with spaces and quotes and all, to the KB, instead of just the raw string Camera?
Related
I wrote the search code and I want to store what is between " " as one place in the list, how I may do that? In this case, I have 3 lists but the second one should is not as I want.
import re
message='read read read'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
should = ors_string.split(' ')
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Output:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly', 'needed"', 'empty']
must_not: ['russia', '"destination good"']
Wanted result:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly needed"', 'empty'] <---
must_not: ['russia', '"destination good"']
Error when edited the message, how to handle it?
Traceback (most recent call last):
ors_string = to_match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Your should list splits on whitespace: should = ors_string.split(' '), this is why the word is split in the list. The following code gives you the output you requested but I'm not sure that is solves your problem for future inputs.
import re
message = 'read "find find":within("exactly needed" OR empty) "plane" -russia -"destination good"'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
# Split on OR instead of whitespace.
should = ors_string.split('OR')
to_remove_or = "OR"
while to_remove_or in should:
should.remove(to_remove_or)
# Remove trailing whitespace that is left after the split.
should = [word.strip() for word in should]
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
url="https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt"
data=urllib.request.urlopen(url)
list_line=[str(x) for x in data]
for line in list_line:
line.replace("b'","")
line.replace("\\n","")
line.replace("\\t","")
print (list_line)
It is generating list like this:
["b'-----BEGIN PRIVACY-ENHANCED MESSAGE-----\n'", "b'Proc-Type: 2001,MIC-CLEAR\n'", "b'Originator-Name: webmaster#www.sec.gov\n'", "b'Originator-Key-Asymmetric:\n'", "b' MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen\n'", "b' TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB\n'", "b'MIC-Info: RSA-MD5,RSA,\n'", "b' EvPdKfnjzBIjWkEk2RgNCk1/52qXomHpN+LDwL/XTT/XBuAzk70AYYrsxlQbyiqr\n'", "b' V5559QRyTgPe9PfVt0db9Q==\n'", "b'\n'", "b'0000950170-98-000413.txt : 19980309\n'", "b'0000950170-98-000413.hdr.sgml : 19980309\n'"] <----sample
I want to remove b',\n and \t , string split and replace not working, how to do it?
Rather than trying to replace things, decode the data as utf-8 to get the resulting text:
import urllib.request
url = "https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt"
data = urllib.request.urlopen(url).read()
text = data.decode('utf-8')
text = text.replace('\t', '') # Remove tabs if still needed
print(text)
This would show the start of the text as:
-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster#www.sec.gov
Originator-Key-Asymmetric:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
EvPdKfnjzBIjWkEk2RgNCk1/52qXomHpN+LDwL/XTT/XBuAzk70AYYrsxlQbyiqr
V5559QRyTgPe9PfVt0db9Q==
<SEC-DOCUMENT>0000950170-98-000413.txt : 19980309
<SEC-HEADER>0000950170-98-000413.hdr.sgml : 19980309
ACCESSION NUMBER: 0000950170-98-000413
CONFORMED SUBMISSION TYPE: 10-K405
PUBLIC DOCUMENT COUNT:
If you want a list of lines add:
lines = text.splitlines()
I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.
When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah
I am trying to extract MAC addresses for each NIC from Dell's RACADM output such that my output should be like below:
NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
I have used the following to extract the output
output = subprocess.check_output("sshpass -p {} ssh {}#{} racadm {}".format(args.password,args.username,args.hostname,args.command),shell=True).decode()
Part of output
https://pastebin.com/cz6LbcxU
Each component details are displayed between ------ lines
I want to search Device Type = NIC and then print Instance ID and Permanent MAC.
regex = r'Device Type = NIC'
match = re.findall(regex, output, flags=re.MULTILINE|re.DOTALL)
match = re.finditer(regex, output, flags=re.S)
I used both the above functions to extract the match but how do I print [InstanceID: NIC.Slot.2-2-1] and PermanentMACAddress of the Matched regex.
Please help anyone?
If I understood correctly,
you can search for the pattern [InstanceID: ...] to get the instance id,
and PermanentMACAddress = ... to get the MAC address.
Here's one way to do it:
import re
match_inst = re.search(r'\[InstanceID: (?P<inst>[^]]*)', output)
match_mac = re.search(r'PermanentMACAddress = (?P<mac>.*)', output)
inst = match_inst.groupdict()['inst']
mac = match_mac.groupdict()['mac']
print('{} --> {}'.format(inst, mac))
# prints: NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
If you have multiple records like this and want to map NIC to MAC, you can get a list of each, zip them together to create a dictionary:
inst = re.findall(r'\[InstanceID: (?P<inst>[^]]*)', output)
mac = re.findall(r'PermanentMACAddress = (?P<mac>.*)', output)
mapping = dict(zip(inst, mac))
Your output looks like INI file content, you could try to parse them using configparser.
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.read_string(output)
>>> for section in config.sections():
... print(section)
... print(config[section]['Device Type'])
...
InstanceID: NIC.Slot.2-2-1
NIC
>>>
I have a input text as follows:
SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";
The output I need is:
SAVE_TIMECARD =:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId
I've tried achieving this using:
result = re.findall(r'[A-z]+(:?=)',inputfile)
I need to extract the upper case words that is SAVE_TIMECARD and allthe words that starts with colon.
I found the solution
import re
regex = re.compile("^[^=]{0,}|:(\w{1,})")
testString = "private static final String SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";"
matchArray = regex.findall(testString)
the matchArray variable contains the list of matches
:\w+
Will identify the 'words starting with a colon'. You'll need to loop through the original text to find all instances.