Python Pattern Regex - python

I have a input text as follows:
SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";
The output I need is:
SAVE_TIMECARD =:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId
I've tried achieving this using:
result = re.findall(r'[A-z]+(:?=)',inputfile)
I need to extract the upper case words that is SAVE_TIMECARD and allthe words that starts with colon.

I found the solution
import re
regex = re.compile("^[^=]{0,}|:(\w{1,})")
testString = "private static final String SAVE_TIMECARD = "insert into sh_user_timecard (instance_id, user_id, in_time, in_time_activity_log_aid, in_time_activity_log_instance_id, " +"out_time, out_time_activity_log_aid, out_time_activity_log_instance_id, parent_aid, parent_instance_id)" + " values (:instanceId, :userId, :inTime, :inTimeActivityAid, :inTimeActivityInstanceId, :outTime, :outTimeActivityAid, " +":outTimeActivityInstanceId, :parentAid, :parentInstanceId)";"
matchArray = regex.findall(testString)
the matchArray variable contains the list of matches

:\w+
Will identify the 'words starting with a colon'. You'll need to loop through the original text to find all instances.

Related

unable to retrieve any candidates with kb.get_candidates

I created a csv file like this:
"CAMERA", "Camera", "kamera", "cam", "Kamera"
"PICTURE", "Picture", "bild", "photograph"
and used it somewhat like this:
nlp = de_core_news_sm.load()
text = "Cam is not good"
doc = nlp(text)
name_dict, desc_dict = load_entities()
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=96)
for qid, desc in desc_dict.items():
desc_doc = nlp(desc)
desc_enc = desc_doc.vector
kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342) # 342 is an arbitrary value here
for qid, name in name_dict.items():
kb.add_alias(alias=name, entities=[qid], probabilities=[1]) # 100% prior probability P(entity|alias)
Printing values like this:
print(f"Entities in the KB: {kb.get_entity_strings()}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
gives me:
Entities in the KB: ['PICTURE', 'CAMERA']
Aliases in the KB: [' "Camera"', ' "Picture"']
However, if I try to check for candidates, I only get an empty list:
candidates = kb.get_candidates("Camera")
print(candidates)
for c in candidates:
print(" ", c.entity_, c.prior_prob, c.entity_vector)
Aliases in the KB: [' "Camera"', ' "Picture"']
It looks to me as if your parsing script added the literal string "Camera", with spaces and quotes and all, to the KB, instead of just the raw string Camera?

Replace double quotes with single quotes inside a certain textfield in a file. Python

So I have some json files which they have in certain senteces this:
"message": "Merge branch " master " of example-week-18"
And while there are double quotes inside the messages items the json gets destroyed.
So basicaly I want to use the replace() method of string but to replace the double quotes with single quotes inside the double quotes of the messages item. I guess I have to use regular expression + replace(). ?
A desired outcome would be this:
INPUT:
"message": "Merge branch " master " of example-week-18"
"message": "Don"t do it"
OUTPUT:
"message": "Merge branch ' master ' of example-week-18"
"message": "Don't do it"
You're right. You can combine regular expression and the replace method.
Here I use the re module to find all the message content (block after "message:" ). Then I replace the double quotes by simple quotes. Finally, I rebuild the original whole message.
Here the code:
# Import module
import re
# Your text
message = """
"message": "Merge branch " master " of example-week-18"
"message": "Don"t do it"
"""
new_text = ""
# Select all the data after: "message":
list_message = re.findall("\"message\"\s*:\s*?(\".*)", message)
# Replace the " by ' in text message content + rebuild original row
for message in list_message:
new_text += '"message": "' + message[1:-1].replace('"', "'") + '"\n'
print(new_text)
# "message": "Merge branch ' master ' of example-week-18"
# "message": "Don't do it"
Hope that helps !

pyparsing how to SkipTo end of indented block?

I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.
When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah

How can I tokenize this text into sentences with Regex

"You could not possibly have come at a better time, my dear Watson,"
he said cordially. 'It is not worth your while to wait,' she went
on."You can pass through the door; no one hinders." And then, seeing that I smiled and shook my head, she suddenly threw aside her
constraint and made a step forward, with her hands wrung together.
Look at the highlighted area. How can I possibly distinguish a case where '"' is followed by a period (.) to end a sentence and a case where a period (.) is followed by a '"'
I have tried this piece for the tokenizer. It works well except for just that one part.
(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)
Edit: I am not planning to use any NLP toolkit to solve this problem.
Use NLTK instead of regular expressions here:
from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']
Found this function a while ago
def split_into_sentences(text):
caps = u"([A-Z])"
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
starters = u"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = u"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = u"[.](com|net|org|io|gov|mobi|info|edu)"
if not isinstance(text,unicode):
text = text.decode('utf-8')
text = u" {0} ".format(text)
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = re.sub(u"\s" + caps + u"[.] ",u" \\1<prd> ",text)
text = re.sub(acronyms+u" "+starters,u"\\1<stop> \\2",text)
text = re.sub(caps + u"[.]" + caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>",text)
text = re.sub(u" "+suffixes+u"[.] "+starters,u" \\1<stop> \\2",text)
text = re.sub(u" "+suffixes+u"[.]",u" \\1<prd>",text)
text = re.sub(u" " + caps + u"[.]",u" \\1<prd>",text)
if u"\"" in text: text = text.replace(u".\"",u"\".")
if u"!" in text: text = text.replace(u"!\"",u"\"!")
if u"?" in text: text = text.replace(u"?\"",u"\"?")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences

how to get the info which contains 'sss', not "= ",

this is my code:
class Marker_latlng(db.Model):
geo_pt = db.GeoPtProperty()
class Marker_info(db.Model):
info = db.StringProperty()
marker_latlng =db.ReferenceProperty(Marker_latlng)
q = Marker_info.all()
q.filter("info =", "sss")
but how to get the info which contains 'sss', not "=",
has a method like "contains "?
q = Marker_info.all()
q.filter("info contains", "sss")
Instead of using a StringProperty, you could use a StringListProperty. Before saving the info string, split it into a list of strings, containing each word.
Then, when you use q.filter("info =", "sss") it will match any item which contains a word which is each to "sss".
For something more general, you could look into app engine full text search.

Categories