How to generate sequential numbers in xml? (Python 2.7) - python

I have a Python script (2.7) that is generating xml data. However, for each object in my array I need it to generate a new sequential number beginning with 1.
Example below (see sequenceOrder tag):
<objectArray>
<object>
<title>Title</title>
<sequenceOrder>1</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>2</sequenceOrder>
</object>
<object>
<title>Title</title>
<sequenceOrder>3</sequenceOrder>
</object>
</objectArray>
With Python 2.7, how can I have my script generate a new number (+1 of the number preceding it) for the sequenceOrder part of each object in my xml array?
Please note that there will be hundreds of thousands of objects in my array, so something to consider.
I'm totally new to Python/coding in general, so any help is appreciated! Glad to provide additional information as necessary.

If you are creating objects to be serialized yourself, you may use itertools.count to get consecutive unique integers.
In a very abstract way it'll look like:
import itertools
counter = itertools.count()
o = create_object()
o.sequentialNumber = next(counter)
o2 = create_another_object()
o.sequentialNumber = next(counter)
create_xml_doc(my_objects)

Yes, you can generate a new sequence number for each XML element. Here is how to produce your sample output using lxml:
import lxml.etree as et
root = et.Element('objectArray')
for i in range(1, 4):
obj = et.Element('object')
title = et.Element('title')
title.text = 'Title'
obj.append(title)
sequenceOrder = et.Element('sequenceOrder')
sequenceOrder.text = str(i)
obj.append(sequenceOrder)
root.append(obj)
print et.tostring(root, pretty_print=True)

A colleague provided a solution:
set
sequence_order = 1
then in xml
<sequenceOrder>""" + str(sequence_order) + """</sequenceOrder>
then later
if test is False:
with open('uniquetest.csv', 'a') as write:
writelog = csv.writer(write, delimiter= '\t', quoting=csv.QUOTE_ALL)
writelog.writerow( (title, ) )
try:
f = open(file + '_output.xml', 'r')
f = open(file + '_output.xml', 'a')
f.write(DASxml_bottom + DASxml_top + digital_objects)
f.close()
sequence_order = sequence_order + 1
# f = open('log.txt', 'a')
# f.write(title + str(roll) + label_flag + str(id) + str(file_size) + file_path + """
# """)
# f.close()
That might make no sense to some since I didn't provide the whole script, but it worked for my purposes! Thanks all for the suggestions

Related

how to convert from csv to xml with python

How to convert this csv format['x','y','width','height','tag']
to this XML format using python script?
<annotation>
<object>
<tag>figure</tag>
<bndbox>
<x>1.0</x>
<y>100.0</y>
<width>303.0</width>
<height>619.0</height>
</bndbox>
</object>
<object>
<tag>text</tag>
<bndbox>
<x>338.0</x>
<y>162.0</y>
<width>143.0</width>
<height>423.0</height>
</bndbox>
</object>
<object>
<tag>text</tag>
<bndbox>
<x>85.0</x>
<y>768.0</y>
<width>554.0</width>
<height>39.0</height>
</bndbox>
</object>
</annotation>
Note this is for the first row of the csv file i want to convert for all row
Good morning.
I was interested in reading / writting the csv files with Python.
This is, because I wasn't able to read a file on my own computer.
I'm not talking more about other files, because I have no idea.
Personally I had tried the IDLE Pyhon editor.
This can also run and debug Python applications.
If you know another Python editor, it can be also another choice.
It's not necessary to use a different application for a certain task.
The python program can read your csv file and writes another xml file. I didn't thought what happens, if the csv file has a title row.
I had created a Py_2_xml.py file.
This can be placed in the same folder as your movies2.csv
First time you can create the file as a text file wih Notepad.
Rename the extension and then edit with the installed IDLE editor.
Here is my code:
import csv
csvfile=open('./movies2.csv')
reader=csv.reader(csvfile)
data="<annotation>\n"
for line in reader:
#print(line)
data = data + ' <object>\n'
data = data + ' <tag>' + line[4] + '</tag>\n'
data = data + ' <bndbox>\n'
i = 0
for item in line:
if i == 0:
data = data + ' <x>' + item + '</x>\n'
elif i == 1:
data = data + ' <y>' + item + '</y>\n'
elif i == 2:
data = data + ' <width>' + item + '</width>\n'
elif i == 3:
data = data + ' <height>' + item + '</height>\n'
i = i + 1
#print(item)
data = data + ' </bndbox>\n'
data = data + ' </object>\n'
csvfile.close()
data = data + '</annotation>\n'
xmlfile=open('outputf.xml','w')
xmlfile.write(data)
xmlfile.close()
print(data)
data = ''
#import subprocess
#subprocess.Popen(["notepad","outputf.xml"])
I'm sorry, first time I missed to write the object tag.
Then the last column was not properly read.
The items in the line object being read by csv reader, can be referenced also as items of an array.
(So that the items in the columns could be read in any order).
This can simplify more the code.
Instead of parsing all items in the line array, do like this:
import csv
csvfile=open('./movies2.csv')
reader=csv.reader(csvfile)
data="<annotation>\n"
for line in reader:
#print(line)
data = data + ' <object>\n'
data = data + ' <tag>' + line[4] + '</tag>\n'
data = data + ' <bndbox>\n'
data = data + ' <x>' + line[0] + '</x>\n'
data = data + ' <y>' + line[1] + '</y>\n'
data = data + ' <width>' + line[2] + '</width>\n'
data = data + ' <height>' + line[3] + '</height>\n'
data = data + ' </bndbox>\n'
data = data + ' </object>\n'
csvfile.close()
data = data + '</annotation>\n'
xmlfile=open('outputf.xml','w')
xmlfile.write(data)
xmlfile.close()
print(data)
data = ''
Preferably the file won't have to be opened in an application.
Use import os module, to copy the file into another location.
But first, it should be obtained the destination path.
Some people use the Tkinkter library to show a SaveDialog.
Be careful at each line indentation in the Python code.
Because when a tab is wrong, it will result an error.
This is really working for me.
Best regards, Adrian Brinas.
You can use python Panda for this
https://pypi.org/project/pandas/
imports pandas as pd
df= pd.read_csv('movies2.csv')
with open('outputf.xml', 'w') as myfile:
myfile.write(df.to_xml())
make sure you have titles in your csv .
import pandas as pd
def convert_row(row):
return """<annotation>
<object>
<tag>%s</tag>
<bndbox>
<x>%s</x>
<y>%s</y>
<width>%s</width>
<height>%s</height>
</bndbox>
</object>""" % (
row.tag, row.x, row.y, row.width, row.hight)
df = pd.read_csv('untitled.txt', sep=',')
print '\n'.join(df.apply(convert_row, axis=1))

Need help figuring out why my csv output is blank?

I wrote this code to create a .csv report from an .xml file, but when I open the .csv that's generated it's blank. Feel free to rip my code apart, by the way, I'm super new to this and want to learn!
There are multiple "Subjectkeys" in the xml, but only some have an "AuditRecord". I only want to pull ones with an audit record, and then for those, I want to pull their info from "SubjectData", "FormData" and "AuditRecord"
import csv
import xml.etree.cElementTree as ET
tree = ET.parse("response.xml")
root = tree.getroot()
xml_data_to_csv =open("query.csv", 'w')
AuditRecord_head = []
SubjectData_head = []
FormData_head = []
csvwriter=csv.writer(xml_data_to_csv)
count=0
for member in root.findall("AuditRecord"):
AuditRecord = []
Subjectdata = []
FormData = []
if count == 0:
Subject = member.find("SubjectKey").tag
Subjectdata_head.append(Subject)
Form = member.find("p1Name").tag
FormData_head.append(Form)
Action = member.find("Action").tag
AuditRecord_head.append(Action)
csvwriter.writerow(Auditrecord_head)
count = count + 1
Subject = member.find('SubjectKey').text
Subjectdata.append(Subject)
Form = member.find('p1Name').text
FormData.append(Form)
Action = member.find("Action").text
AuditRecord.append(Action)
csvwriter.writerow(Subjectdata)
xml_data_to_csv.close()
I expect the output to be a table with column headings: Subject, Form, Action.
Here is sample .xml:
</ClinicalData>
<ClinicalData StudyOID="SMK-869-002" MetaDataVersionOID="2.0">
<SubjectData SubjectKey="865-015">
</AuditRecord>
</FormData>
<FormData p1:Name="Medical History" p1:Started="Y" FormOID="mh" FormRepeatKey="0"/>
<FormData p1:Name="Medical History" p1:Started="Y" FormOID="mh" FormRepeatKey="1">
<p1:QueryAction InitialComment="Please enter start date for condition" UserType="User" UserOID="bailey#protocolfirst.com" Action="query" DateTimeStamp="2019-07-12T14:08:43.893Z"/>
</AuditRecord>
First of all your xml file has a lot of errors, to me it has to look like:
<?xml version="1.0"?>
<root xmlns:p1="http://some-url.com">
<ClinicalData StudyOID="SMK-869-002" MetaDataVersionOID="2.0"></ClinicalData>
<SubjectData SubjectKey="865-015"></SubjectData>
<AuditRecord>
<FormData p1:Name="Medical History" p1:Started="Y" FormOID="mh" FormRepeatKey="0"/>
<FormData p1:Name="Medical History" p1:Started="Y" FormOID="mh" FormRepeatKey="1"/>
<p1:QueryAction InitialComment="Please enter start date for condition" UserType="User" UserOID="bailey#protocolfirst.com" Action="query" DateTimeStamp="2019-07-12T14:08:43.893Z"/>
</AuditRecord>
</root>
ElementTree always expects only a single root node, and a well-formed document.
I do not understand very well what your trying to do, but I hope this could help you:
import xml.etree.cElementTree as ET
tree = ET.parse("response.xml")
root = tree.getroot()
xml_data_to_csv = open("query.csv", 'w')
list_head=[]
count=0
for member in root.findall("AuditRecord"):
AuditRecord = []
Subjectdata = []
FormData = []
if count == 0:
Subjectdata.append(root.find('./SubjectData').attrib['SubjectKey'])
for formData in root.findall('./AuditRecord/FormData'):
#print(formData.attrib['{http://some-url.com}Name'])
FormData.append(formData.attrib['{http://some-url.com}Name'])
AuditRecord.append(root.find('./AuditRecord/{http://some-url.com}QueryAction').attrib['Action'])
xml_data_to_csv.write(Subjectdata[0] + "," + FormData[0] + "," + FormData[1] + "," + AuditRecord[0])
count = count + 1
xml_data_to_csv.close()
This will produce a csv file with the following content:
865-015,Medical History,Medical History,query

How to write CSV into the next column

I have output that I can write into a CSV. However, because of how i setup my XML to text, the output iterates itself incorrectly. I've tried a lot to fix my XML output, but I don't see any way to fix it.
I've tried a lot, including modifying my XML statements to trying to write to CSV in different ways, but I can't seem to get the rows to match up the way I need them to be, because of the the for in statements that have different depths.
I don't really care how it's done, so long as it matches up, because the data is ultimately fed into my SQL database.
Below is my code,
import os
import sys
import glob
import xml.etree.ElementTree as ET
firstFile = open("myfile.csv", "a")
firstFile.write("V-ID,")
firstFile.write("HostName,")
firstFile.write("Status,")
firstFile.write("Comments,")
firstFile.write("Finding Details,")
firstFile.write("STIG Name,")
basePath = os.path.dirname(os.path.realpath(__file__))
xmlFile = os.path.join(basePath, "C:\\Users\\myUserName\\Desktop\\Scripts\\Python\\XMLtest.xml")
tree = ET.parse(xmlFile)
root = tree.getroot()
for child in root.findall('{http://checklists.nist.gov/xccdf/1.2}title'):
d = child.text
for child in root:
for children in child.findall('{http://checklists.nist.gov/xccdf/1.2}target'):
b = children.text
for child in root.findall('{http://checklists.nist.gov/xccdf/1.2}Group'):
x = (str(child.attrib))
x = (x.split('_')[6])
a = x[:-2]
firstFile.write("\n" + a + ',')
for child in root:
for children in child:
for childrens in children.findall('{http://checklists.nist.gov/xccdf/1.2}result'):
x = childrens.text
if ('pass' in x):
c = 'Completed'
else:
c = 'Ongoing'
firstFile.write('\t' + '\n' + ',' + b + ',' + c + ',' + ',' + ',' + d)
firstFile.close()
below is my CSV current output,
below is the output I need,
try to change this
x = (x.split('_')[0])
Think of what your CSV output looks like, then think of what it SHOULD look like.
Your CSV is generated by these two loops:
for child in root.findall('{http://checklists.nist.gov/xccdf/1.2}Group'):
x = (str(child.attrib))
x = (x.split('_')[6])
a = x[:-2]
firstFile.write("\n" + a + ',')
for child in root:
for children in child:
for childrens in children.findall('{http://checklists.nist.gov/xccdf/1.2}result'):
x = childrens.text
if ('pass' in x):
c = 'Completed'
else:
c = 'Ongoing'
firstFile.write('\t' + '\n' + ',' + b + ',' + c + ',' + ',' + ',' + d)
That means that anything you add to your CSV file in the second loop will be written AFTER what was added in the first loop.
I can think of two ideas to solve this from the top of my head:
Somehow fuse the loops into one, thus generating each row in one loop iteration. I don't know if that works with your parameters, though.
Don't add to the end of the CSV file, but add to the row you want to add by specifically telling the "write" function where to write.
Disclaimer: I'm not familiar with Python at all, this is just the logical source of the problem and two solutions that should work in theory. I do not know if either of them is practicable.
This fixed it for me. I basically left it as is, put it into a CSV, then read the CSV, stored the columns as a list, removed blank spaces and exported it back out.
import os
import sys
import glob
import csv
import xml.etree.ElementTree as ET
firstFile = open("myfile.csv", "a")
path = 'C:\\Users\\JT\\Desktop\\Scripts\\Python\\xccdf\\'
for fileName in glob.glob(os.path.join(path, '*.xml')):
with open('C:\\Users\\JT\\Desktop\\Scripts\\Python\\myfile1.csv', 'w', newline='') as csvFile1:
csvWriter = csv.writer(csvFile1, delimiter=',')
# do your stuff
tree = ET.parse(fileName)
root = tree.getroot()
# Stig Title
for child in root.findall('{http://checklists.nist.gov/xccdf/1.2}title'):
d = child.text
# hostName
for child in root:
for children in child.findall('{http://checklists.nist.gov/xccdf/1.2}target'):
b = children.text
# V-ID
for child in root.findall('{http://checklists.nist.gov/xccdf/1.2}Group'):
x = (str(child.attrib))
x = (x.split('_')[6])
a = x[:-2]
firstFile.write(a + '\n')
# Status
for child in root:
for children in child:
for childrens in children.findall('{http://checklists.nist.gov/xccdf/1.2}result'):
x = childrens.text
firstFile.write(',' + b + ',' + x + ',' + ',' + ',' + d + '\n')
with open('C:\\Users\\JT\\Desktop\\Scripts\\Python\\myfile.csv', 'r') as csvFile:
csvReader = csv.reader(csvFile, delimiter=',')
vIDs = []
hostNames = []
status = []
stigTitles = []
for line in csvReader:
vID = line[0]
vIDs.append(vID)
try:
hostName = line[1]
hostNames.append(hostName)
except:
pass
try:
state = line[2]
status.append(state)
except:
pass
try:
stigTitle = line[5]
stigTitles.append(stigTitle)
except:
pass
with open('C:\\Users\\JT\\Desktop\\Scripts\\Python\\myfile1.csv', 'a', newline='') as csvFile1:
csvWriter = csv.writer(csvFile1, delimiter=',')
vIDMod = list(filter(None, vIDs))
hostNameMod = list(filter(None, hostNames))
statusMod = list(filter(None, status))
stigTitlesMod = list(filter(None, stigTitles))
csvWriter.writerows(zip(vIDMod, hostNameMod, statusMod, stigTitlesMod))
firstFile.close()

python, fetch sequence from DAS by coordinates

ucsc DAS server, which get DNA sequences by coordinate.
URL: http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060
sample file:
<DASDNA>
<SEQUENCE id="chr20" start="30037832" stop="30038060" version="1.00">
<DNA length="229">
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
</DNA>
</SEQUENCE>
</DASDNA>
what I want is this part:
gtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc
tccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg
cgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc
tttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc
acactctatcaataaacacctctggctga
I want to get the sequence part from thousands of this kind urls, how should i do it?
I tried to write the data to file and parse the file, it worked ok, but is there any way to parse the xml-like string directly? i tried some example from other posts, but they didn't work.
Here, I added my solution. Thanks to the 2 answers below.
Solution 1:
def getSequence2(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
doc = etree.parse(url,parser=etree.XMLParser())
if doc != '':
sequence = doc.xpath('SEQUENCE/DNA/text()')[0].replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Solution 2:
def getSequence1(chromosome, start, end):
base = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
url = base + chromosome + ':' + str(start) + ',' + str(end)
xml = urllib2.urlopen(url).read()
if xml != '':
w = open('temp.xml', 'w')
w.write(xml)
w.close()
dom = parse('temp.xml')
data = dom.getElementsByTagName('DNA')
sequence = data[0].firstChild.nodeValue.replace('\n','')
else:
sequence = 'THE SEQUENCE DOES NOT EXIST FOR GIVEN COORDINATES'
return sequence
Of course they will need to import some necessary libraries.
>>> from lxml import etree
>>> doc = etree.parse("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:30037432,30038060",parser=etree.XMLParser())
>>> doc.xpath('SEQUENCE/DNA/text()')
['\natagtggcacatgtctgttgtcctagctcctcggggaaactcaggtggga\ngagtcccttgaactgggaggaggaggtttgcagtgagccagaatcattcc\nactgtactccagcctaggtgacagagcaagactcatctcaaaaaaaaaaa\naaaaaaaaaaaaaagacaatccgcacacataaaggctttattcagctgat\ngtaccaaggtcactctctcagtcaaaggtgggaagcaaaaaaacagagta\naaggaaaaacagtgatagatgaaaagagtcaaaggcaagggaaacaaggg\naccttctatctcatctgtttccattcttttacagacctttcaaatccgga\ngcctacttgttaggactgatactgtctcccttctttctgctttgtgtcag\ngtggcacccaaagatgctggaatctttatggcaaatgccgttacagatgc\ntccaagaaggaaagagtctatgtttactgcataaataataaaatgtgctg\ncgtgaagcccaagtaccagccaaaagaaaggtggtggccattttaactgc\ntttgaagcctgaagccatgaaaatgcagatgaagctcccagtggattccc\nacactctatcaataaacacctctggctga\n']
Use a Python XML parsing library like lxml, load the XML file with that parser, and then use a selector (e.g. using XPath) to grab the node/element that you need.

How to Modify Python Code in Order to Print Multiple Adjacent "Location" Tokens to Single Line of Output

I am new to python, and I am trying to print all of the tokens that are identified as locations in an .xml file to a .txt file using the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('exercise-ner.xml', 'r'))
tokenlist = soup.find_all('token')
output = ''
for x in tokenlist:
readeachtoken = x.ner.encode_contents()
checktoseeifthetokenisalocation = x.ner.encode_contents().find("LOCATION")
if checktoseeifthetokenisalocation != -1:
output += "\n%s" % x.word.encode_contents()
z = open('exercise-places.txt','w')
z.write(output)
z.close()
The program works, and spits out a list of all of the tokens that are locations, each of which is printed on its own line in the output file. What I would like to do, however, is to modify my program so that any time beautiful soup finds two or more adjacent tokens that are identified as locations, it can print those tokens to the same line in the output file. Does anyone know how I might modify my code to accomplish this? I would be entirely grateful for any suggestions you might be able to offer.
This question is very old, but I just got your note #Amanda and I thought I'd post my approach to the task in case it might help others:
import glob, codecs
from bs4 import BeautifulSoup
inside_location = 0
location_string = ''
with codecs.open("washington_locations.txt","w","utf-8") as out:
for i in glob.glob("/afs/crc.nd.edu/user/d/dduhaime/java/stanford-corenlp-full-2015-01-29/processed_washington_correspondence/*.xml"):
locations = []
with codecs.open(i,'r','utf-8') as f:
soup = BeautifulSoup(f.read())
tokens = soup.findAll('token')
for token in tokens:
if token.ner.string == "LOCATION":
inside_location = 1
location_string += token.word.string + u" "
else:
if location_string:
locations.append( location_string )
location_string = ''
out.write( i + "\t" + "\t".join(l for l in locations) + "\n" )

Categories