I have this script
SELECT = """
select
coalesce (p.ID,'') as id,
coalesce (p.name,'') as name,
from TABLE as p
"""
self.cur.execute(SELECT)
for row in self.cur.itermap():
id = '%(id)s' % row
name = '%(name)s' % row
xml +=" <item>\n"
xml +=" <id>" + id + "</id>\n"
xml +=" <name>" + name + "</name>\n"
xml +=" </item>\n\n"
#save xml to file here
f = open...
and I need to save data from huge database to file. There are 10 000s (up to 40000) of items in my database and it takes very long time when script runs (1 hour and more) until finish.
How can I take data I need from database and save it to file "at once"? (as quick as possible? I don't need xml output because I can process data from output on my server later. I just need to do it as quickly as possible. Any idea?)
Many thanks!
P.S.
I found out this interesting thing: When I use this code to "erase" xml variable every 2000 records and save it to another variable it works pretty fast! So there must be something "wrong" with filling in xml variable according to my former code.
result = float(id)/2000
if result == int(result):
xml_whole += xml
xml = ""
wow, after testing with code
result = float(id)/2000
if result == int(result):
xml_whole += xml
xml = ""
is my script up to 50x faster!
i would like to know why is python so slow with xml +=... ?
You're doing a lot of unnecessary work (and however, if you erase the xml variable, you're not writing the same data as before...)
Why don't you just write the XML as it goes? You could also avoid the two COALESCEs, and write that check in Python (if ID is null then make id '', etc.).
SELECT = """
select
coalesce (p.ID,'') as id,
coalesce (p.name,'') as name,
from TABLE as p
"""
self.cur.execute(SELECT)
# Open XML file
f = open("file.xml", ...)
f.write("<?xml version... (what encoding?)
for row in self.cur.itermap():
f.write("<item>\n <id>%(id)s</id>\n <name>%(name)s</name>\n</item>\n"
# Other f.writes() if necessary
f.close()
Related
I have a large .txt file that is a result of a C-file being parsed containing various blocks of data, but about 90% of them are useless to me. I'm trying to get rid of them and then save the result to another file, but have hard time doing so. At first I tried to delete all useless information in unparsed file, but then it won't parse. My .txt file is built like this:
//Update: Files I'm trying to work on comes from pycparser module, that I found on a GitHub.
File before being parsed looks like this:
And after using pycparser
file_to_parse = pycparser.parse_file(current_directory + r"\D_Out_Clean\file.d_prec")
I want to delete all blocks that starts with word Typedef. This module stores this in an one big list that I can access via it's attribute.
Currently my code looks like this:
len_of_ext_list = len(file_to_parse.ext)
i = 0
while i < len_of_ext_list:
if 'TypeDecl' not in file_to_parse.ext[i]:
print("NOT A TYPEDECL")
print(file_to_parse.ext[i], type(file_to_parse.ext[i]))
parsed_file_2 = open(current_directory + r"\Zadanie\D_Out_Clean_Parsed\clean_file.d_prec", "w+")
parsed_file_2.write("%s%s\n" %("", file_to_parse.ext[i]))
parsed_file_2.close
#file_to_parse_2 = file_to_parse.ext[i]
i+=1
But above code only saves one last FuncDef from a unparsed file, and I don't know how to change it.
So, now I'm trying to get rid of all typedefs in parsed file as they don't have any valuable information for me. I want to now what functions definitions and declarations are in file, and what type of global variables are stored in parsed file. Hope this is more clear now.
I suggest reading the entire input file into a string, and then doing a regex replacement:
with open(current_directory + r"\D_Out\file.txt", "r+") as file:
with open(current_directory + r"\D_Out_Clean\clean_file.txt", "w+") as output:
data = file.read()
data = re.sub(r'type(?:\n\{.*?\}|[^;]*?;)\n?', '', data, flags=re.S)
output.write(line)
Here is a regex demo showing that the replacement logic is working.
I've been using python to implement a custom parser and use that parsed data to format a word document to be distributed internally. All of the formatting has been straightforward and easy so far but I'm completely stumped on how to insert a checkbox into individual table cells.
I've tried using the python object functions within python-docx (using get_or_add_tcPr(), etc.) which causes MS Word to throw the following error when I try to open the file, "The file xxxx cannot be opened because there are problems with the contents Details: The file is corrupt and cannot be opened".
After struggling with this for a while I moved to a second approach involving manipulating the word/document.xml file for the output doc. I've retrieved what I believe to be the correct xml for a checkbox saved as replacementXML and have inserted filler text into the cells to act as a tag that can be searched and replaced, searchXML. The following seems to run using python in a linux (Fedora 25) environment but the word document displays the same errors when I try to open the document, however this time the document is recoverable and reverts back to the filler text. I've been able to get this to work with a manually made document and using an empty table cell, so I believe that this should be possible. NOTE: I've included the whole xml element for the table cell in the searchXML variable, but I've tried using regular expressions and shortening the string. Not just using an exact match as I know this could differ cell to cell.
searchXML = r'<w:tc><w:tcPr><w:tcW w:type="dxa" w:w="4320"/><w:gridSpan w:val="2"/></w:tcPr><w:p><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:t>IN_CHECKB</w:t></w:r></w:p></w:tc>'
def addCheckboxes():
os.system("mkdir unzipped")
os.system("unzip tempdoc.docx -d unzipped/")
with open('unzipped/word/document.xml', encoding="ISO-8859-1") as file:
filedata = file.read()
rep_count = 0
while re.search(searchXML, filedata):
filedata = replaceXML(filedata, rep_count)
rep_count += 1
with open('unzipped/word/document.xml', 'w') as file:
file.write(filedata)
os.system("zip -r ../buildcfg/tempdoc.docx unzipped/*")
os.system("rm -rf unzipped")
def replaceXML(filedata, rep_count):
replacementXML = r'<w:tc><w:tcPr><w:tcW w:w="4320" w:type="dxa"/><w:gridSpan w:val="2"/></w:tcPr><w:p w:rsidR="00D2569D" w:rsidRDefault="00FD6FDF"><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/>
</w:rPr><w:fldChar w:fldCharType="begin"><w:ffData><w:name w:val="Check1"/><w:enabled/><w:calcOnExit w:val="0"/><w:checkBox><w:sizeAuto/><w:default w:val="0"/></w:checkBox></w:ffData></w:fldChar>
</w:r><w:bookmarkStart w:id="' + rep_count + '" w:name="Check' + rep_count + '"/><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:instrText xml:space="preserve"> FORMCHECKBOX </w:instrText></w:r><w:r>
<w:rPr><w:sz w:val="16"/></w:rPr></w:r><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:fldChar w:fldCharType="end"/></w:r><w:bookmarkEnd w:id="' + rep_count + '"/></w:p></w:tc>'
filedata = re.sub(searchXML, replacementXML, filedata, 1)
rerturn filedata
I have a strong feeling that there is a much simpler (and correct!) way of doing this through the python-docx library but for some reason I can't seem to get it right.
Is there a way to easily insert checkbox fields into a table cell in an MS Word doc? And if yes, how would I do that? If no, is there a better approach than manipulating the .xml file?
UPDATE: I've been able to inject XML into the document succesffuly using python-docx but the checkbox and added XML are not appearing.
I've added the following XML into a table cell:
<w:tc>
<w:tcPr>
<w:tcW w:type="dxa" w:w="4320"/>
<w:gridSpan w:val="2"/>
</w:tcPr>
<w:p>
<w:r>
<w:bookmarkStart w:id="0" w:name="testName">
<w:complexType w:name="CT_FFCheckBox">
<w:sequence>
<w:choice>
<w:element w:name="size" w:type="CT_HpsMeasure"/>
<w:element w:name="sizeAuto" w:type="CT_OnOff"/>
</w:choice>
<w:element w:name="default" w:type="CT_OnOff" w:minOccurs="0"/>
<w:element w:name="checked" w:type="CT_OnOff" w:minOccurs="0"/>
</w:sequence>
</w:complexType>
</w:bookmarkStart>
<w:bookmarkEnd w:id="0" w:name="testName"/>
</w:r>
</w:p>
</w:tc>
by using the following python-docx code:
run = p.add_run()
tag = run._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), n)
tag.append(start)
ctype = docx.oxml.OxmlElement('w:complexType')
ctype.set(docx.oxml.ns.qn('w:name'), 'CT_FFCheckBox')
seq = docx.oxml.OxmlElement('w:sequence')
choice = docx.oxml.OxmlElement('w:choice')
el = docx.oxml.OxmlElement('w:element')
el.set(docx.oxml.ns.qn('w:name'), 'size')
el.set(docx.oxml.ns.qn('w:type'), 'CT_HpsMeasure')
el2 = docx.oxml.OxmlElement('w:element')
el2.set(docx.oxml.ns.qn('w:name'), 'sizeAuto')
el2.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
choice.append(el)
choice.append(el2)
el3 = docx.oxml.OxmlElement('w:element')
el3.set(docx.oxml.ns.qn('w:name'), 'default')
el3.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el3.set(docx.oxml.ns.qn('w:minOccurs'), '0')
el4 = docx.oxml.OxmlElement('w:element')
el4.set(docx.oxml.ns.qn('w:name'), 'checked')
el4.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el4.set(docx.oxml.ns.qn('w:minOccurs'), '0')
seq.append(choice)
seq.append(el3)
seq.append(el4)
ctype.append(seq)
start.append(ctype)
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), n)
tag.append(end)
Can't seem to find reasoning for the XML not being reflected in the output document but will update with whatever I find.
I've finally been able to accomplish this after lots of digging and help from #scanny.
Checkboxes can be inserted into any paragraph in python-docx using the following function. I am inserting a checkbox into specific cells in a table.
def addCheckbox(para, box_id, name, checked):
run = para.add_run()
tag = run._r
fldchar = docx.oxml.shared.OxmlElement('w:fldChar')
fldchar.set(docx.oxml.ns.qn('w:fldCharType'), 'begin')
ffdata = docx.oxml.shared.OxmlElement('w:ffData')
name = docx.oxml.shared.OxmlElement('w:name')
name.set(docx.oxml.ns.qn('w:val'), cb_name)
enabled = docx.oxml.shared.OxmlElement('w:enabled')
calconexit = docx.oxml.shared.OxmlElement('w:calcOnExit')
calconexit.set(docx.oxml.ns.qn('w:val'), '0')
checkbox = docx.oxml.shared.OxmlElement('w:checkBox')
sizeauto = docx.oxml.shared.OxmlElement('w:sizeAuto')
default = docx.oxml.shared.OxmlElement('w:default')
if checked:
default.set(docx.oxml.ns.qn('w:val'), '1')
else:
default.set(docx.oxml.ns.qn('w:val'), '0')
checkbox.append(sizeauto)
checkbox.append(default)
ffdata.append(name)
ffdata.append(enabled)
ffdata.append(calconexit)
ffdata.append(checkbox)
fldchar.append(ffdata)
tag.append(fldchar)
run2 = para.add_run()
tag2 = run2._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), str(box_id))
start.set(docx.oxml.ns.qn('w:name'), name)
tag2.append(start)
run3 = para.add_run()
tag3 = run3._r
instr = docx.oxml.OxmlElement('w:instrText')
instr.text = 'FORMCHECKBOX'
tag3.append(instr)
run4 = para.add_run()
tag4 = run4._r
fld2 = docx.oxml.shared.OxmlElement('w:fldChar')
fld2.set(docx.oxml.ns.qn('w:fldCharType'), 'end')
tag4.append(fld2)
run5 = para.add_run()
tag5 = run5._r
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), str(box_id))
end.set(docx.oxml.ns.qn('w:name'), name)
tag5.append(end)
return
The fldData.text object seems random but was taken from the generated XML form a word document with an existing checkbox. The function fails without setting this text. I have not confirmed but I have heard of one scenario where a developer was arbitrarily changing the string but once saved it would revert back to the original generated value.
The key thing with these workaround functions is to have an example of XML that works, and to be able to compare the XML you generate. If you generate XML that matches the working example, it will work every time. opc-diag is handy for inspecting the XML in a Word document. Working with really small documents (like single paragraph or two-row table, for analysis purposes) makes it a lot easier to work out how Word is structuring the XML.
An important thing to note is that the XML elements in a Word document are sequence sensitive, meaning the child elements within any other element generally have a set order in which they must appear. If you get this swapped around, you get the "repair" error you mentioned.
I find it much easier to manipulate the XML from within python-docx, as it takes care of all the unzipping and rezipping for you, along with a lot of the other details.
To get the sequencing right, you'll need to be familiar with the XML Schema specifications for the elements you're working with. There is an example here:
http://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/paragraph-format.html
The full schema is in the code tree under ref/xsd/. Most of the elements for text are in the wml.xsd file (wml stands for WordProcessing Markup Language).
You can find examples of other so-called "workaround functions" by searching on "python-docx" workaround function. Pay particular attention to the parse_xml() function and the OxmlElement objects which will allow you to create new XML subtrees and individual elements respectively. XML elements can be positioned using regular lxml._Element methods; all XML elements in python-docx are based on lxml. http://lxml.de/api/lxml.etree._Element-class.html
Ok so I've been playing with python and spss to achieve almost what I want. I am able to open the file and make the changes, however I am having trouble saving the files (and those changes). What I have (using only one school in the schoollist):
begin program.
import spss, spssaux
import os
schoollist = ['brow']
for x in schoollist:
school = 'brow'
school2 = school + '06.sav'
filename = os.path.join("Y:\...\Data", school2) #In this instance, Y:\...\Data\brow06.sav
spssaux.OpenDataFile(filename)
#--This block are the changes and not particularly relevant to the question--#
cur=spss.Cursor(accessType='w')
cur.SetVarNameAndType(['name'],[8])
cur.CommitDictionary()
for i in range(cur.GetCaseCount()):
cur.fetchone()
cur.SetValueChar('name', school)
cur.CommitCase()
cur.close()
#-- What am I doing wrong here? --#
spss.Submit("save outfile = filename".)
end program.
Any suggestions on how to get the save outfile to work with the loop? Thanks. Cheers
In your save call, you are not resolving filename to its actual value. It should be something like this:
spss.Submit("""save outfile="%s".""" % filename)
I'm unfamiliar with spssaux.OpenDataFile and can't find any documentation on it (besides references to working with SPSS data files in unicode mode). But what I am going to guess is the problem is that it grabs the SPSS data file for use in the Python program block, but it isn't actually opened to further submit commands.
Here I make a test case that instead of using spssaux.OpenDataFile to grab the file, does it all with SPSS commands and just inserts the necessary parts via python. So first lets create some fake data to work with.
*Prepping the example data files.
FILE HANDLE save /NAME = 'C:\Users\andrew.wheeler\Desktop\TestPython'.
DATA LIST FREE / A .
BEGIN DATA
1
2
3
END DATA.
SAVE OUTFILE = "save\Test1.sav".
SAVE OUTFILE = "save\Test2.sav".
SAVE OUTFILE = "save\Test3.sav".
DATASET CLOSE ALL.
Now here is a paired down version of what your code is doing. I have the LIST ALL. command inserted in so you can check the output that it is adding the variable of interest to the file.
*Sequential opening the data files and appending data name.
BEGIN PROGRAM.
import spss
import os
schoollist = ['1','2','3']
for x in schoollist:
school2 = 'Test' + x + '.sav'
filename = os.path.join("C:\\Users\\andrew.wheeler\\Desktop\\TestPython", school2)
#opens the SPSS file and makes a new variable for the school name
spss.Submit("""
GET FILE = "%s".
STRING Name (A20).
COMPUTE Name = "%s".
LIST ALL.
SAVE OUTFILE = "%s".
""" %(filename, x,filename))
END PROGRAM.
I'm using minidom (among others) in Python to pull a list of files from a directory, get their modified times, other misc. data and then write that data to an XML file. The data prints just fine, but when I try to write the data to a file, I only get the XML for one of the files in the directory. Here is my code (I've removed a good amount of createElement and appendChild methods as well as any non-relevant variables for the sake of readability/space):
for filename in os.listdir((os.path.join('\\\\10.10.10.80\Jobs\success'))):
doc = Document()
modTime = datetime.datetime.fromtimestamp(os.path.getmtime('\\\\10.10.10.80\Jobs\success\\'+filename)).strftime('%I:%M:%S %p')
done = doc.createElement('Printed Orders')
doc.appendChild(done)
ordernum = doc.createElement(filename)
done.appendChild(ordernum)
#This is where other child elements have been removed
print doc.toprettyxml(indent=' ')
xmlData = open(day_path, 'w')
xmlData.write(doc.toprettyxml(indent=' '))
Hopefully this is enough to see what's going on. Since print returns the values I am expecting, I think that the write function is where I'm going wrong.
If I understood your itent
you mustn't create one different document for each file so you have to put the creation of the document and the writing of the xml file outside the loop
from xml.dom.minidom import Document
import os,datetime
path = "/tmp/"
day_path ="today.xml"
doc = Document()
done = doc.createElement('Printed Orders')
for filename in os.listdir((os.path.join(path))):
print "here"
modTime = datetime.datetime.fromtimestamp(os.path.getmtime(path+filename)).strftime('%I:%M:%S %p')
doc.appendChild(done)
ordernum = doc.createElement(filename)
done.appendChild(ordernum)
#This is where other child elements have been removed
print doc.toprettyxml(indent=' ')
xmlData = open(day_path, 'w')
xmlData.write(doc.toprettyxml(indent=' '))
EDIT:
for the HierarchyRequestErr error you have to put the creation of the root element outside the loop also
I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.
Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.
Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")
I'm using import csv.
(a side question: it is possible to make csv go to a specific line in the file? I want to make the program start at a random line)
Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?
If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.
Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.
import csv
import sqlite3
con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')
f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')
cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()
you can use memory mapping for really big files
import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)
Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).
from bisect import bisect_left
f = open('myfile.csv')
words = []
for line in f:
words.extend(line.strip().split(','))
wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
print '%s was found!' % wordtofind
It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].
Side Note
I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with \n characters that indicate new lines.
You can't go directly to a specific line in the file because lines are variable-length, so the only way to know when line #n starts is to search for the first n newlines. And it's not enough to just look for '\n' characters because CSV allows newlines in table cells, so you really do have to parse the file anyway.
my idea is to use python zodb module to store dictionaty type data and then create new csv file using that data structure. do all your operation at that time.
There is a fairly simple way to do this.Depending on how many columns you want python to print then you may need to add or remove some of the print lines.
import csv
search=input('Enter string to search: ')
stock=open ('FileName.csv', 'wb')
reader=csv.reader(FileName)
for row in reader:
for field in row:
if field==code:
print('Record found! \n')
print(row[0])
print(row[1])
print(row[2])
I hope this managed to help.