Parsing dynamic arabic character xml in python

Parsing dynamic arabic character xml in python - python

This is the code. However, this code can only parse 4 characters of Arabian only. I want it to parse dynamically. So, the number of characters does not matter. Therefore, it can parse 1 character, 2 character or more based on the number of existing characters.
import xml.etree.ElementTree as ET
import os, glob
import csv
from time import time
#read xml path
xml_path = glob.glob('D:\1. Thesis FINISH!!!\*.xml')
#create file declaration for saving the result
file = open("parsing.csv","w")
#file = open("./%s" % ('parsing.csv'), 'w')
#create variable of starting time
t0=time()
#create file header
file.write('wordImage_id'+'|'+'paw1'+'|'+'paw2'+'|' + 'paw3' + '|' + 'paw4' + '|'+'font_size'+'|'+'font_style'+
'|'+'font_name'+'|'+'specs_effect'+'|'+'specs_height'+'|'+'specs_height'
+'|'+'specs_width'+'|'+'specs_encoding'+'|'+'generation_filtering'+
'|'+'generation_renderer'+'|'+'generation_type' + '\n')
for doc in xml_path:
print 'Reading file - ', os.path.basename(doc)
tree = ET.parse(doc)
#tree = ET.parse('D:\1. Thesis FINISH!!!\Image_14_AdvertisingBold_13.xml')
root = tree.getroot()
#get wordimage id
image_id = root.attrib['id']
#get paw 1 and paw 2
paw1 = root[0][0].text
paw2 = root[0][1].text
paw3 = root[0][2].text
paw4 = root[0][3].text
#get properties of font
for font in root.findall('font'):
size = font.get('size')
style = font.get('fontStyle')
name = font.get('name')
#get properties of specs
for specs in root.findall('specs'):
effect = specs.get('effect')
height = specs.get('height')
width = specs.get('width')
encoding = specs.get('encoding')
#get properties for generation
for generation in root.findall('generation'):
filtering = generation.get('filtering')
renderer = generation.get('renderer')
types = generation.get('type')
#save the result in csv
file.write(image_id + '|' + paw1 + '|' + paw2 + '|' + paw3 + '|' + paw4 + '|' + size + '|' +
style + '|' + name + '|' + effect + '|' + height + '|'
+ width + '|' + encoding + '|' + filtering + '|' + renderer + '|' + types + '\n')
#close the file
file.close()
#print time execution
print("process done in %0.3fs." % (time() - t0))

Related

PDF template not merging data properly with pdftk

I'm editing a PDF template with using pdftk
command = ("pdftk " + '"' +
template + '"' +
" fill_form " + '"' +
pathUser + user['mail'] + ".xfdf" + '"' +
" output " + '"' +
pathUser + user['mail'] + ".pdf" + '"' +
" need_appearances")
command = command.replace('/', '\\')
os.system(command)
First I'm writing my data in a .xfdf file
for key, value in user.items():
print(key, value)
fields.append(u"""<field name="%s"><value>%s</value></field>""" % (key, value))
tpl = u"""<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve">
<fields>
%s
</fields>
</xfdf>""" % "\n".join(fields)
f = open(pathUser + user['mail'] + '.xfdf', 'wb')
f.write(tpl.encode("utf-8"))
f.close()
I fetch the template and as shown above, write the data from the xfdf to pdf but for some reason, only the ime gets written.
Templates get fetched using some basic conditional logic as shown below:
for item in user['predavanja']:
user[acthead + str(actn)] = item
actn += 1
for item in user['radionice']:
user[acthead + str(actn)] = item
actn += 1
for item in user['izlet']:
user[acthead + str(actn)] = item
actn += 1
print(actn)
templates = {}
templates['0'] = "Template/2019/certificate_2019.pdf"
templates['5'] = "Template/2019/certificate_2019_5.pdf"
templates['10'] = "Template/2019/certificate_2019_10.pdf"
templates['15'] = "Template/2019/certificate_2019_15.pdf"
templates['20'] = "Template/2019/certificate_2019_20.pdf"
templates['25'] = "Template/2019/certificate_2019_25.pdf"
templates['30'] = "Template/2019/certificate_2019_30.pdf"
templates['35'] = "Template/2019/certificate_2019_35.pdf"
templates['40'] = "Template/2019/certificate_2019_40.pdf"
templates['45'] = "Template/2019/certificate_2019_45.pdf"
templates['50'] = "Template/2019/certificate_2019_50.pdf"
I'm writing this data
user['id'] = data['recommendations'][0]['role_in_team']['user']['id']
user['ime'] = data['recommendations'][0]['role_in_team']['user']['first_name']
user['prezime'] = data['recommendations'][0]['role_in_team']['user']['last_name']
user['tim'] = data['recommendations'][0]['role_in_team']['team']['short_name']
user['mail'] = data['recommendations'][0]['role_in_team']['user']['estudent_email']
user['puno_ime'] = (data['recommendations'][0]['role_in_team']['user']['first_name'] + ' ' +
data['recommendations'][0]['role_in_team']['user']['last_name'])
user['predavanja'] = predavanja
user['radionice'] = radionice
user['izlet'] = izlet
One note. predavanja, radionice and izlet are lists.
I've tried printing tpl which shows all the data being properly added to the scheme.

Turns out the issue was the naming of the variables since they didn't match the field names in the acroform PDF. So the solution was to rename the variables in the code to match the field names.

Why does my Python XML parser break after the first file?

I am working on a Python (3) XML parser that should extract the text content of specific nodes from every xml file within a folder. Then, the script should write the collected data into a tab-separated text file. So far, all the functions seem to be working. The script returns all the information that I want from the first file, but it always breaks, I believe, when it starts to parse the second file.
When it breaks, it returns "TypeError: 'str' object is not callable." I've checked the second file and found that the functions work just as well on that as the first file when I remove the first file from the folder. I'm very new to Python/XML. Any advice, help, or useful links would be greatly appreciated. Thanks!
import xml.etree.ElementTree as ET
import re
import glob
import csv
import sys
content_file = open('WWP Project/WWP_texts.txt','wt')
quotes_file = open('WWP Project/WWP_quotes.txt', 'wt')
list_of_files = glob.glob("../../../Documents/WWPtextbase/distribution/*.xml")
ns = {'wwp':'http://www.wwp.northeastern.edu/ns/textbase'}
def content(tree):
lines = ''.join(ET.tostring(tree.getroot(),encoding='unicode',method='text')).replace('\n',' ').replace('\t',' ').strip()
clean_lines = re.sub(' +',' ', lines)
return clean_lines.lower()
def quotes(tree):
quotes_list = []
for node in tree.findall('.//wwp:quote', namespaces=ns):
quote = ET.tostring(node,encoding='unicode',method='text')
clean_quote = re.sub(' +',' ', quote)
quotes_list.append(clean_quote)
return ' '.join(str(v) for v in quotes_list).replace('\t','').replace('\n','').lower()
def pid(tree):
for node in tree.findall('.//wwp:sourceDesc//wwp:author/wwp:persName[1]', namespaces=ns):
pid = node.attrib.get('ref')
return pid.replace('personography.xml#','') # will need to replace 'p:'
def trid(tree): # this function will eventually need to call OT (.//wwp:publicationStmt//wwp:idno)
for node in tree.findall('.//wwp:sourceDesc',namespaces=ns):
trid = node.attrib.get('n')
return trid
content_file.write('pid' + '\t' + 'trid' + '\t' +'text' + '\n')
quotes_file.write('pid' + '\t' + 'trid' + '\t' + 'quotes' + '\n')
for file_name in list_of_files:
file = open(file_name, 'rt')
tree = ET.parse(file)
file.close()
pid = pid(tree)
trid = trid(tree)
content = content(tree)
quotes = quotes(tree)
content_file.write(pid + '\t' + trid + '\t' + content + '\n')
quotes_file.write(pid + '\t' + trid + '\t' + quotes + '\n')
content_file.close()
quotes_file.close()

You are overwriting your function calls with the values they returned. changing the function names should fix it.
import xml.etree.ElementTree as ET
import re
import glob
import csv
import sys
content_file = open('WWP Project/WWP_texts.txt','wt')
quotes_file = open('WWP Project/WWP_quotes.txt', 'wt')
list_of_files = glob.glob("../../../Documents/WWPtextbase/distribution/*.xml")
ns = {'wwp':'http://www.wwp.northeastern.edu/ns/textbase'}
def get_content(tree):
lines = ''.join(ET.tostring(tree.getroot(),encoding='unicode',method='text')).replace('\n',' ').replace('\t',' ').strip()
clean_lines = re.sub(' +',' ', lines)
return clean_lines.lower()
def get_quotes(tree):
quotes_list = []
for node in tree.findall('.//wwp:quote', namespaces=ns):
quote = ET.tostring(node,encoding='unicode',method='text')
clean_quote = re.sub(' +',' ', quote)
quotes_list.append(clean_quote)
return ' '.join(str(v) for v in quotes_list).replace('\t','').replace('\n','').lower()
def get_pid(tree):
for node in tree.findall('.//wwp:sourceDesc//wwp:author/wwp:persName[1]', namespaces=ns):
pid = node.attrib.get('ref')
return pid.replace('personography.xml#','') # will need to replace 'p:'
def get_trid(tree): # this function will eventually need to call OT (.//wwp:publicationStmt//wwp:idno)
for node in tree.findall('.//wwp:sourceDesc',namespaces=ns):
trid = node.attrib.get('n')
return trid
content_file.write('pid' + '\t' + 'trid' + '\t' +'text' + '\n')
quotes_file.write('pid' + '\t' + 'trid' + '\t' + 'quotes' + '\n')
for file_name in list_of_files:
file = open(file_name, 'rt')
tree = ET.parse(file)
file.close()
pid = get_pid(tree)
trid = get_trid(tree)
content = get_content(tree)
quotes = get_quotes(tree)
content_file.write(pid + '\t' + trid + '\t' + content + '\n')
quotes_file.write(pid + '\t' + trid + '\t' + quotes + '\n')
content_file.close()
quotes_file.close()

Adding the values of two strings using Python and XML path

It generates an output with wallTime and setupwalltime into a dat file, which has the following format:
24000 4 0
81000 17 0
192000 59 0
648000 250 0
1536000 807 0
3000000 2144 0
6591000 5699 0
I would like to know how to add the two values i.e.(wallTime and setupwalltime) together. Can someone give me a hint? I tried converting to float, but it doesn’t seem to work.
import libxml2
import os.path
from numpy import *
from cfs_utils import *
np=[1,2,3,4,5,6,7,8]
n=[20,30,40,60,80,100,130]
solver=["BiCGSTABL_iluk", "BiCGSTABL_saamg", "BiCGSTABL_ssor" , "CG_iluk", "CG_saamg", "CG_ssor" ]# ,"cholmod", "ilu" ]
file_list=["eval_BiCGSTABL_iluk_default", "eval_BiCGSTABL_saamg_default" , "eval_BiCGSTABL_ssor_default" , "eval_CG_iluk_default","eval_CG_saamg_default", "eval_CG_ssor_default" ] # "simp_cholmod_solver_3D_evaluate", "simp_ilu_solver_3D_evaluate" ]
for cnt_np in np:
i=0
for sol in solver:
#open write_file= "Graphs/" + "Np"+ cnt_np + "/CG_iluk.dat"
#"Graphs/Np1/CG_iluk.dat"
write_file = open("Graphs/"+ "Np"+ str(cnt_np) + "/" + sol + ".dat", "w")
print("Reading " + "Graphs/"+ "Np"+ str(cnt_np) + "/" + sol + ".dat"+ "\n")
#loop through different unknowns
for cnt_n in n:
#open file "cfs_calculations_" + cnt_n +"np"+ cnt_np+ "/" + file_list(i) + "_default.info.xml"
read_file = "cfs_calculations_" +str(cnt_n) +"np"+ str(cnt_np) + "/" + file_list[i] + ".info.xml"
print("File list" + file_list[i] + "vlaue of i " + str(i) + "\n")
print("Reading " + " cfs_calculations_" +str(cnt_n) +"np"+ str(cnt_np) + "/" + file_list[i] + ".info.xml" )
#read wall and cpu time and write
if os.path.exists(read_file):
doc = libxml2.parseFile(read_file)
xml = doc.xpathNewContext()
walltime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#wall")
setupwalltime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/setup/timer/#wall")
# cputime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#cpu")
# setupcputime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/solve/timer/#cpu")
unknowns = 3*cnt_n*cnt_n*cnt_n
write_file.write(str(unknowns) + "\t" + walltime + "\t" + setupwalltime + "\n")
print("Writing_point" + str(unknowns) + "%f" ,float(setupwalltime ) )
doc.freeDoc()
xml.xpathFreeContext()
write_file.close()
i=i+1

In java you can add strings and floats. What I understand is that you need to add the values and then display them. That would work (stringing the sum)
write_file.write(str(unknowns) + "\f" + str(float(walltime) + float(setupwalltime)) + "\n")

You are trying to add a str to a float. That doesn't work. If you want to use string concatenation, first coerce all of the values to str. Try this:
write_file.write(str(unknowns) + "\t" + str(float(walltime) + float(setupwalltime)) + "\n")
Or, perhaps more readably:
totalwalltime = float(walltime) + float(setupwalltime)
write_file.write("{}\t{}\n".format(unknowns, totalwalltime))

How to encrypt a .bin file

I need to encrypt 3 .bin files which contain 2 keys for Diffie-Hellman. I have no clue how to do that, all I could think of was what I did in the following Python file. I have an example what the output should look like but my code doesn't seem to produce the right keys. The output file server.ini is used by a client to connect to a server.
import base64
fileList = [['game_key.bin', 'Game'], ['gate_key.bin', 'Gate'], ['auth_key.bin', 'Auth']]
iniList = []
for i in fileList:
file = open(i[0], 'rb')
n = list(file.read(64))
x = list(file.read(64))
file.close()
n.reverse()
x.reverse()
iniList.append(['Server.' + i[1] + '.N "' + base64.b64encode("".join(n)) + '"\n', 'Server.' + i[1] + '.X "' + base64.b64encode("".join(x)) + '"\n'])
iniList[0].append('\n')
#time for user Input
ip = '"' + raw_input('Hostname: ') + '"'
dispName = 'Server.DispName ' + '"' + raw_input('DispName: ') + '"' + '\n'
statusUrl = 'Server.Status ' + '"' + raw_input('Status URL: ') + '"' + '\n'
signupUrl = 'Server.Signup ' + '"' + raw_input('Signup URL: ') + '"' + '\n'
for l in range(1, 3):
iniList[l].append('Server.' + fileList[l][1] + '.Host ' + ip + '\n\n')
for l in [[dispName], [statusUrl], [signupUrl]]:
iniList.append(l)
outFile = open('server.ini', 'w')
for l in iniList:
for i in l:
outFile.write(i)
outFile.close()
The following was in my example file:
# Keys are Base64-encoded 512 bit RC4 keys, as generated by DirtSand's keygen
# command. Note that they MUST be quoted in the commands below, or the client
# won't parse them correctly!
I also tried it without inverting n and x

Trouble with apostrophe in arcpy search cursor where clause

I've put together a tkinter form and python script for downloading files from an ftp site. The filenames are in the attribute table of a shapefile, as well as an overall Name that the filenames correspond too. In other words I look up a Name such as "CABOT" and download the filename 34092_18.tif. However, if a Name has an apostrophe, such as "O'KEAN", it's giving me trouble. I try to replace the apostrophe, like I've done in previous scripts, but it doesn't download anything....
whereExp = quadField + " = " + "'" + quadName.replace("'", '"') + "'"
quadFields = ["FILENAME"]
c = arcpy.da.SearchCursor(collarlessQuad, quadFields, whereExp)
for row in c:
tifFile = row[0]
tifName = quadName.replace("'", '') + '_' + tifFile
#fullUrl = ftpUrl + tifFile
local_filename = os.path.join(downloadDir, tifName)
lf = open(local_filename, "wb")
ftp.retrbinary('RETR ' + tifFile, lf.write)
lf.close()
Here is an example of a portion of a script that works fine by replacing the apostrophe....
where_clause = quadField + " = " + "'" + quad.replace("'", '"') + "'"
#out_quad = quad.replace("'", "") + ".shp"
arcpy.MakeFeatureLayer_management(quadTable, "quadLayer")
select_out_feature_class = arcpy.SelectLayerByAttribute_management("quadLayer", "NEW_SELECTION", where_clause)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing dynamic arabic character xml in python - python

Related

PDF template not merging data properly with pdftk

Why does my Python XML parser break after the first file?

Adding the values of two strings using Python and XML path

How to encrypt a .bin file

Trouble with apostrophe in arcpy search cursor where clause

Categories

Resources