preserving text structure information - pyparsing

preserving text structure information - pyparsing - python

Using pyparsing, is there a way to extract the context you are in during recursive descent. Let me explain what I mean. I have the following code:
import pyparsing as pp
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
ident = pp.Word(pp.alphanums + "_" + ".")
comment = pp.Literal("//") + pp.restOfLine
messageName = ident
messageKw = pp.Suppress(pp.Keyword("msg"))
text = pp.Word(pp.alphanums + "_" + "." + "-" + "+")
otherText = ~messageKw + pp.Suppress(text)
messageExpr = pp.Forward()
messageExpr << (messageKw + messageName + openBrace +
pp.ZeroOrMore(otherText) + pp.ZeroOrMore(messageExpr) +
pp.ZeroOrMore(otherText) + closeBrace).ignore(comment)
testStr = "msg msgName1 { some text msg msgName2 { some text } some text }"
print messageExpr.parseString(testStr)
which produces this output: ['msgName1', 'msgName2']
In the output, I would like to keep track of the structure of embedded matches. What I mean is that, for example, I would like the following output with the test string above: ['msgName1', 'msgName1.msgName2'] to keep track of the hierarchy in the text. However, I am new to pyparsing and have yet to find a way yet to extract the fact that "msgName2" is embedded in the structure of "msgName1."
Is there a way to use the setParseAction() method of ParserElement to do this, or maybe using naming of results?
Helpful advice would be appreciated.

Thanks to Paul McGuire for his sagely advice. Here are the additions/changes I made, which solved the problem:
msgNameStack = []
def pushMsgName(str, loc, tokens):
msgNameStack.append(tokens[0])
tokens[0] = '.'.join(msgNameStack)
def popMsgName(str, loc, tokens):
msgNameStack.pop()
closeBrace = pp.Suppress(pp.Literal("}")).setParseAction(popMsgName)
messageName = ident.setParseAction(pushMsgName)
And here is the complete code:
import pyparsing as pp
msgNameStack = []
def pushMsgName(str, loc, tokens):
msgNameStack.append(tokens[0])
tokens[0] = '.'.join(msgNameStack)
def popMsgName(str, loc, tokens):
msgNameStack.pop()
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}")).setParseAction(popMsgName)
ident = pp.Word(pp.alphanums + "_" + ".")
comment = pp.Literal("//") + pp.restOfLine
messageName = ident.setParseAction(pushMsgName)
messageKw = pp.Suppress(pp.Keyword("msg"))
text = pp.Word(pp.alphanums + "_" + "." + "-" + "+")
otherText = ~messageKw + pp.Suppress(text)
messageExpr = pp.Forward()
messageExpr << (messageKw + messageName + openBrace +
pp.ZeroOrMore(otherText) + pp.ZeroOrMore(messageExpr) +
pp.ZeroOrMore(otherText) + closeBrace).ignore(comment)
testStr = "msg msgName1 { some text msg msgName2 { some text } some text }"
print messageExpr.parseString(testStr)

Related

PDF template not merging data properly with pdftk

I'm editing a PDF template with using pdftk
command = ("pdftk " + '"' +
template + '"' +
" fill_form " + '"' +
pathUser + user['mail'] + ".xfdf" + '"' +
" output " + '"' +
pathUser + user['mail'] + ".pdf" + '"' +
" need_appearances")
command = command.replace('/', '\\')
os.system(command)
First I'm writing my data in a .xfdf file
for key, value in user.items():
print(key, value)
fields.append(u"""<field name="%s"><value>%s</value></field>""" % (key, value))
tpl = u"""<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
%s
</fields>
</xfdf>""" % "\n".join(fields)
f = open(pathUser + user['mail'] + '.xfdf', 'wb')
f.write(tpl.encode("utf-8"))
f.close()
I fetch the template and as shown above, write the data from the xfdf to pdf but for some reason, only the ime gets written.
Templates get fetched using some basic conditional logic as shown below:
for item in user['predavanja']:
user[acthead + str(actn)] = item
actn += 1
for item in user['radionice']:
user[acthead + str(actn)] = item
actn += 1
for item in user['izlet']:
user[acthead + str(actn)] = item
actn += 1
print(actn)
templates = {}
templates['0'] = "Template/2019/certificate_2019.pdf"
templates['5'] = "Template/2019/certificate_2019_5.pdf"
templates['10'] = "Template/2019/certificate_2019_10.pdf"
templates['15'] = "Template/2019/certificate_2019_15.pdf"
templates['20'] = "Template/2019/certificate_2019_20.pdf"
templates['25'] = "Template/2019/certificate_2019_25.pdf"
templates['30'] = "Template/2019/certificate_2019_30.pdf"
templates['35'] = "Template/2019/certificate_2019_35.pdf"
templates['40'] = "Template/2019/certificate_2019_40.pdf"
templates['45'] = "Template/2019/certificate_2019_45.pdf"
templates['50'] = "Template/2019/certificate_2019_50.pdf"
I'm writing this data
user['id'] = data['recommendations'][0]['role_in_team']['user']['id']
user['ime'] = data['recommendations'][0]['role_in_team']['user']['first_name']
user['prezime'] = data['recommendations'][0]['role_in_team']['user']['last_name']
user['tim'] = data['recommendations'][0]['role_in_team']['team']['short_name']
user['mail'] = data['recommendations'][0]['role_in_team']['user']['estudent_email']
user['puno_ime'] = (data['recommendations'][0]['role_in_team']['user']['first_name'] + ' ' +
data['recommendations'][0]['role_in_team']['user']['last_name'])
user['predavanja'] = predavanja
user['radionice'] = radionice
user['izlet'] = izlet
One note. predavanja, radionice and izlet are lists.
I've tried printing tpl which shows all the data being properly added to the scheme.

Turns out the issue was the naming of the variables since they didn't match the field names in the acroform PDF. So the solution was to rename the variables in the code to match the field names.

Python adds a double entry at the end of output file

This is my code to create a hashtag file. The issue is it does not put the # for the first hashtag and at he end it puts a double hashtag like below.
passiveincome, #onlinemarketing, #wahmlife, #cash, #entrepreneurlifestyle, #makemoneyonline, #makemoneyfast, #entrepreneurlifestyle, #mlm, #mlm
How do I get the code to remove the double output and put the # at the beginning?
import random, os, sys
basepath = os.path.dirname(sys.argv[0]) + "/"
outputpath = "C:/Users/matth/OneDrive/Desktop/Create hashtags/"
paragraphsmin = 9
paragraphsmax = 9
sentencemin = 1
sentencemax = 1
keywords = []
for line in open(basepath + "/base.txt", "r"):
keywords.append(line.replace("\n",""))
keywordlist = []
keyword = open(basepath + "/text-original.txt", "r")
for line in keyword:
keywordlist.append(line.replace("\n", "\n"))
def type(name):
value = name[random.randint(0,len(name)-1)]
return value
"""
def xyz(num):
s1 = '' + type(keywordlist).strip()
return eval('s' + str(num))
"""
def s1():
return '' + type(keywordlist).strip()
def randomSentence():
sent = eval("s" + str(random.randint(1,1)) + "()")
return sent
for keyword in keywords:
outputfile = open(outputpath + keyword.replace(" ", " ") + ".txt", "w")
outputfile.write('')
for p in range(1,random.randint(paragraphsmin,paragraphsmax) + 1):
outputfile.write('')
for s in range(1,random.randint(sentencemin,sentencemax) + 1):
sentence = randomSentence()
if str(sentence)[0] == "\"":
outputfile.write("" + str(sentence)[0] + str(sentence)[1] + str(sentence)[2:] + " ")
else:
outputfile.write("" + str(sentence)[0] + str(sentence)[1:] + ", #")
outputfile.write('')
outputfile.write(sentence.replace("", "") + "")
outputfile.close()

Try replacing
outputfile.write("" + str(sentence)[0] + str(sentence)[1:] + ", #")
with
outputfile.write("#" + str(sentence)[0] + str(sentence)[1:] + ", ")

Change API call in for loop python

I am trying to call several api datasets within a for loop in order to change the call and then append those datasets together into a larger dataframe.
I have written this code which works to call the first dataset but then returns this error for the next call.
`url = base + "max=" + maxrec + "&" "type=" + item + "&" + "freq=" + freq + "&" + "px=" +px + "&" + "ps=" + str(ps) + "&" + "r="+ r + "&" + "p=" + p + "&" + "rg=" +rg + "&" + "cc=" + cc + "&" + "fmt=" + fmt
TypeError: must be str, not Response`
Here is my current code
import requests
import pandas as pd
base = "http://comtrade.un.org/api/get?"
maxrec = "50000"
item = "C"
freq = "A"
px="H0"
ps="all"
r="all"
p="0"
rg="2"
cc="AG2"
fmt="json"
comtrade = pd.DataFrame(columns=[])
for year in range(1991,2018):
ps="{}".format(year)
url = base + "max=" + maxrec + "&" "type=" + item + "&" + "freq=" + freq + "&" + "px=" +px + "&" + "ps=" + str(ps) + "&" + "r="+ r + "&" + "p=" + p + "&" + "rg=" +rg + "&" + "cc=" + cc + "&" + "fmt=" + fmt
r = requests.get(url)
x = r.json()
new = pd.DataFrame(x["dataset"])
comtrade = comtrade.append(new)

Let requests assemble the URL for you.
common_params = {
"max": maxrec,
"type": item,
"freq": freq,
# etc
}
for year in range(1991,2018):
response = requests.get(base, params=dict(common_params, ps=str(year))
response_data = response.json()
new = pd.DataFrame(response_data["dataset"])
comtrade = comtrade.append(new)

Disclaimer: the other answer is correct and you should use it.
However, your actual problem stems from the fact that you are overriding r here:
r = requests.get(url)
x = r.json()
During the next iteration r will still be that value and not the one you initialized it with in the first place. You could simply rename it to result to avoid that problem. Better let the requests library do the work though.

Automatic labeling of LDA generated topics

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics:
(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')
Is there a way to automatically label them? I do have a csv file which has feedbacks manually labeled, but I do not want to supply these labels myself. I want the model to create labels. Is it possible?

The comments here link to another SO answer that links to a paper. Let's say you wanted to do the minimum to try to make this work. Here is an MVP-style solution that has worked for me: search Google for the terms, then look for keywords in the response.
Here is some working, though hacky, code:
pip install cssselect
then
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter
def get_srp_text(search_term):
raw = get(f"https://www.google.com/search?q={topic_terms}").text
page = fromstring(raw)
blob = ""
for result in page.cssselect("a"):
for res in result.findall("div"):
blob += ' '
blob += res.text if res.text else " "
blob += ' '
return blob
def blob_cleaner(blob):
clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
return ''.join(e for e in blob if e.isalnum() or e.isspace())
def get_name_from_srp_blob(clean_blob):
blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
c = Counter(blob_tokens)
most_common = c.most_common(10)
name = f"{most_common[0][0]}-{most_common[1][0]}"
return name
pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))
Then you can just get the topic words from your model, e.g.
topic_terms = "delivery area mile option partner traffic hub thanks city way"
name = pipeline(topic_terms)
print(name)
>>> City-Transportation
and
topic_terms = "package address time customer apartment delivery number item support door"
name = pipeline(topic_terms)
print(name)
>>> Parcel-Package
You could improve this up a lot. For example, you could use POS tags to only find the most common nouns, then use those for the name. Or find the most common adjective and noun, and make the name "Adjective Noun". Even better, you could get the text from the linked sites, then run YAKE to extract keywords.
Regardless, this demonstrates a simple way to automatically name clusters, without directly using machine learning (though, Google is most certainly using it to generate the search results, so you are benefitting from it).

Parsing dynamic arabic character xml in python

This is the code. However, this code can only parse 4 characters of Arabian only. I want it to parse dynamically. So, the number of characters does not matter. Therefore, it can parse 1 character, 2 character or more based on the number of existing characters.
import xml.etree.ElementTree as ET
import os, glob
import csv
from time import time
#read xml path
xml_path = glob.glob('D:\1. Thesis FINISH!!!\*.xml')
#create file declaration for saving the result
file = open("parsing.csv","w")
#file = open("./%s" % ('parsing.csv'), 'w')
#create variable of starting time
t0=time()
#create file header
file.write('wordImage_id'+'|'+'paw1'+'|'+'paw2'+'|' + 'paw3' + '|' + 'paw4' + '|'+'font_size'+'|'+'font_style'+
'|'+'font_name'+'|'+'specs_effect'+'|'+'specs_height'+'|'+'specs_height'
+'|'+'specs_width'+'|'+'specs_encoding'+'|'+'generation_filtering'+
'|'+'generation_renderer'+'|'+'generation_type' + '\n')
for doc in xml_path:
print 'Reading file - ', os.path.basename(doc)
tree = ET.parse(doc)
#tree = ET.parse('D:\1. Thesis FINISH!!!\Image_14_AdvertisingBold_13.xml')
root = tree.getroot()
#get wordimage id
image_id = root.attrib['id']
#get paw 1 and paw 2
paw1 = root[0][0].text
paw2 = root[0][1].text
paw3 = root[0][2].text
paw4 = root[0][3].text
#get properties of font
for font in root.findall('font'):
size = font.get('size')
style = font.get('fontStyle')
name = font.get('name')
#get properties of specs
for specs in root.findall('specs'):
effect = specs.get('effect')
height = specs.get('height')
width = specs.get('width')
encoding = specs.get('encoding')
#get properties for generation
for generation in root.findall('generation'):
filtering = generation.get('filtering')
renderer = generation.get('renderer')
types = generation.get('type')
#save the result in csv
file.write(image_id + '|' + paw1 + '|' + paw2 + '|' + paw3 + '|' + paw4 + '|' + size + '|' +
style + '|' + name + '|' + effect + '|' + height + '|'
+ width + '|' + encoding + '|' + filtering + '|' + renderer + '|' + types + '\n')
#close the file
file.close()
#print time execution
print("process done in %0.3fs." % (time() - t0))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

preserving text structure information - pyparsing - python

Related

PDF template not merging data properly with pdftk

Python adds a double entry at the end of output file

Change API call in for loop python

Automatic labeling of LDA generated topics

Parsing dynamic arabic character xml in python

Categories

Resources