I have a file that has multiple XML declarations.
<?xml version="1.0" encoding="UTF-8"?>
I am currently reading the file as a .txt file and rewriting each line that is not a XML declaration into a new .txt file. As I have many such document files, this method is taking time (around 20mins per file). I wanted to know if there was an easier way to do this.
I am using Python to do this. The files are sitting on my laptop and each file is around 11 Million lines (450mb size).
My code for iterating through the file and removing the declarations is below.
month_file = "2015-01.nml.txt"
delete_lines = [
'<?xml version="1.0" encoding="ISO-8859-1" ?>',
'<?xml version="1.0" encoding="iso-8859-1" ?>',
'<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">',
]
with open(month_file, encoding="ISO-8859-1") as in_fh:
while True:
line = in_fh.readline()
if not line: break
if any(x in line for x in delete_lines):
continue
else:
out_fh = open('myfile_faster.xml', "a")
out_fh.write(line)
out_fh.close()
This is essenstially the same as your version, but opens input and output just the once, also has a single if condition, and writes to the output as it iterates through the input (sort of like sed).
with open(in_file, mode="rt") as f_in, open(out_file, mode="wt") as f_out:
for line in f_in:
if (
not line
or line.startswith("<?xml")
or line.startswith("<!DOCTYPE")
):
continue
f_out.write(line)
Related
I am using python to deal with xml file, I need to insert one line to the xml file, and the code is like this:
xobj = ET.parse('/src/xxx.xml')
xroot = xobj.getroot()
filename = ET.Element("filename")
filename.text = xmlname
xroot.insert(0, filename)
tree = ET.ElementTree(xroot)
tree.write('/dst/xxx.xml')
It did insert one line of contents to the original xml file, but it was not a line. My xml file becomes:
<filename>004228.xml</filename><object>
....
</object>
There should be a \n between </filename> and <object>, but this method does not have that line spliter, how could I make the format look nice ?
I have an XML file that looks like this!
<?xml version="1.0"?>
<root>
<child>
<add key="setid" value=".\print\data1" />
<add key="getid" value=".\print\data2" />
<add key="holdingid" value=".\print\data3" />
</child>
</root>
I want to read a line in the XML, search for a key match and replace a value in that line with .\donotpritnnt\data1
I can do this in NANT using XMLPOKE and xpath.
Tried this with dict, list (replacing with positionID), split (not good enough code to show here)
How can I solve this?
Assuming your XML code is in test.xml in the working directory.
new_content = []
with open("test.xml") as f:
content = f.readlines()
for line in content:
newline = line
if 'key="setid"' in line:
newline = line[:line.find("value=")] + 'value=".\donotpritnnt\data1" />\n'
new_content += [newline]
with open('test.xml', 'w') as f:
for line in new_content:
f.write(line)
Its not the nicest way, but it works for the example you provided, and may also be extentable since it iterates over the file and looks at each line. Further improvements may probably by using regex etc.
So I have patent data I wish to store from an XML to a CSV file. I've been able to run my code through each iteration of the invention name, date, country, and patent number, but when I try to write the results into a CSV file something goes wrong.
The XML data looks like this (for one section of many):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0584026-20090106.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20081222" date-publ="20090106">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0584026</doc-number>
<kind>S1</kind>
<date>20090106</date>
</document-id>
</publication-reference>
My code for running through and writing these lines one-by-one is:
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
#print(inv_name.text, pat_num.text, date_num.text, country.text)
#lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
And lastly, the output in my .csv file is this:
"Content addressable information encapsulation, representation, and transfer",07475432,20090106,US
I'm unsure where the issue lies and I know I'm still quite a newbie at Python but can anyone find the problem?
You open the file in overwrite mode ('wb') inside a loop. On each iteration you erase what could have been previously written. The correct way is to open the file outside the loop:
...
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
...
The problem lies in this line with open('./output.csv', 'wb') as f:
If you want to write all rows into a single file, use mode a. Using wb will overwrite the file and thus you are only getting the last line.
Read more about the file mode here: https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
I need to generate xml files poulated with data from a csv file in python
I have two input files:
one CSV file named data.csv containing data like this:
ID YEAR PASS LOGIN HEX_LOGIN
14Z 2013 (3e?k<.P#H}l hex0914Z F303935303031345A
14Z 2014 EAeW+ZM..--r hex0914Z F303935303031345A
.......
One Template file named template.xml
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year></year>
<security>
<ID></ID>
<login></login>
<hex_login></hex_login>
<pass></pass>
</security>
</SecurityProfile>
I want to get as many output files as lines in the csv data file, each output filed named YEAR_ID, with the data from the csv file in the xml fields:
Output files contentes:
Content of output file #1 named 2013_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2013</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>(3e?k<.P#H}l</pass>
</security>
</SecurityProfile>
Content of output file #2 named 2014_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2014</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>EAeW+ZM..--r</pass>
</security>
</SecurityProfile>
Thank you for your suggestions.
Can you make changes the template? If so, I would do the following to make this a bit simpler:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>{year}</year>
<security>
<ID>{id}</ID>
<login>{login}</login>
<hex_login>{hex_login}</hex_login>
<pass>{pass}</pass>
</security>
</SecurityProfile>
Then, something like this would work:
import csv
input_file_name = "some_file.csv" #name/path of your csv file
template_file_name = "some_file.xml" #name/path of your xml template
output_file_name = "{}_09500{}.xml"
with open(template_file_name,"r") as template_file:
template = template_file.read()
with open(input_file_name,"r") as csv_file:
my_reader = csv.DictReader(csv_file)
for row in my_reader:
with open(output_file_name.format(row["YEAR"],row["ID"]),"w") as current_out:
current_out.write(template.format(year=row["YEAR"],
id=row["ID"],
login=row["LOGIN"],
hex_login=row["HEX_LOGIN"],
pass=row["PASS"]))
If you can't modify the template, or want to process it as XML instead of basic string manipulation, then it's a bit more involved.
EDIT:
Modified answer to use csv.DictReader rather than csv.reader.
Fixed variable names opening input CSV file and writing the output. Removed 'binary' mode file operations.
import csv
from collections import defaultdict
header = '<?xml version="1.0"?><SecurityProfile xmlns="security_profile_v1">\n'
footer = '\n</SecurityProfile>'
entry = '''<security>
<ID>{0[ID]}</ID>
<login>{0[LOGIN]}</login>
<hex_login>{0[HEX_LOGIN]}</hex_login>
<pass>{0[PASS]}</pass>
</security>'''
rows = defaultdict(list)
with open('infile.csv') as f:
reader = csv.DictReader(f, delimiter='\t')
for item in reader:
rows[reader['YEAR']].append(item)
for year,data in rows.iteritems():
with open('{}.xml'.format(year), 'w') as f:
f.write(header)
f.write('<year>{}</year>\n'.format(year))
for record in data:
f.write(entry.format(record))
f.write('\n')
f.write(footer)
I am generating XML files using xml.dom.minidom. Every time I generate a file on the very row there appears <?xml version="1.0" ?> and the generated file looks like this:
<?xml version="1.0" ?>
<Root>
data
</Root>
is not there anyway so have an output without and my output should look like
<Root>
data
</Root>
The best solution I found was to write out .childNodes[0], i.e. write out:
doc.childNodes[0].toprettyxml()
to the file, which will omit the xml version tag.
If you are happy just to trim the first line from the file, use this code;
f = open( 'file.txt', 'r' )
lines = f.readlines()
f.close()
f = open( 'file.txt'.'w' )
f.write( '\n'.join( lines[1:] ) )
f.close()
This does the job where old_data is the xml to strip
new_data = old_data[old_data.find("?>")+2:]