I'm trying to read an xml file into python, pull out certain elements from the xml file and then write the results back to an xml file (so basically it's the original xml file without several elements). When I use .removeChild(source) it removes the individual elements I want to remove but leaves white space in its stead making the file very unreadable. I know I can still parse the file with all of the whitespace, but there are times when I need to manually alter the values of certain element's attributes and it makes it difficult (and annyoing) to do this. I can certainly remove the whitespace by hand but if I have dozens of these xml files that's not really feasible.
Is there a way to do .removeChild and have it remove the white space as well?
Here's what my code looks like:
dom=parse(filename)
main=dom.childNodes[0]
sources = main.getElementsByTagName("source")
for source in sources :
name=source.getAttribute("name")
spatialModel=source.getElementsByTagName("spatialModel")
val1=float(spatialModel[0].getElementsByTagName("parameter")[0].getAttribute("value"))
val2=float(spatialModel[0].getElementsByTagName("parameter")[1].getAttribute("value"))
if angsep(val1,val2,X,Y)>=ROI :
main.removeChild(source)
else:
print name,val1,val2,angsep(val1,val2,X,Y)
f=open(outfile,"write")
f.write("<?xml version=\"1.0\" ?>\n")
f.write(dom.saveXML(main))
f.close()
Thanks much for the help.
If you have PyXML installed you can use xml.dom.ext.PrettyPrint()
I couldn't figure out how to do this using xml.dom.minidom, so I just wrote a quick function to read in the output file and remove all blank lines and then rewrite to a new file:
f = open(xmlfile).readlines()
w = open('src_model.xml','w')
empty=re.compile('^$')
for line in open(xmlfile).readlines():
if empty.match(line):
continue
else:
w.write(line)
This works good enough for me :)
… for searching ppl:
This funny snippet
skey = lambda x: getattr(x, "tagName", None)
mainnode.childNodes = sorted(
[n for n in mainnode.childNodes if n.nodeType != n.TEXT_NODE],
cmp=lambda x, y: cmp(skey(y), skey(x)))
removes all text nodes (and, also, reverse sorts them by tagname).
I.e. you can (recursively) do tr.childNodes = [recurseclean(n) for n in tr.childNodes if n.nodeType != n.TEXT_NODE] to remove all text nodes
Or you might want to do something like … if n.nodeType != n.TEXT_NODE or not re.match(r'^[:whitespace:]*$', n.data, re.MULTILINE) (did't try that one myself) if you need text nodes with some data. Or something more complex to leave text inside specific tags.
After that tree.toprettyxml(…) will return well-formatted XML text.
I know, that this question is quite dated, but since it took a while to figure out different approaches to the problem, here are my solutions:
The best way, I found is using lxml, indeed:
from lxml import etree
root = etree.fromstring(data)
# for tag in root.iter('tag') doesn't cope with namespaces...
for tag in root.xpath('//*[local-name() = "tag"]'):
tag.getparent().remove(tag)
data = etree.tostring(root, encoding = 'utf-8', pretty_print = True)
With minidom, it's a bit more convoluted due to the fact, that every node is accompanied with a trailing whitespace node:
import xml.dom.minidom
dom = xml.dom.minidom.parseString(data)
for tag in dom.getElementsByTagName('tag'):
if tag.nextSibling \
and tag.nextSibling.nodeType == meta.TEXT_NODE \
and tag.nextSibling.data.isspace():
tag.parentNode.removeChild(tag.nextSibling)
tag.parentNode.removeChild(tag)
data = dom.documentElement.toxml(encoding = 'utf-8')
Related
As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
This is my python file:-
TestCases-2
Input-5
Output-1,1,2,3,5
Input-7
Ouput-1,1,2,3,5,8,13
What I want is this:-
A variable test_no = 2 (No. of testcases)
A list testCaseInput = [5,7]
A list testCaseOutput = [[1,1,2,3,5],[1,1,2,3,5,8,13]]
I've tried doing it in this way:
testInput = testCase.readline(-10)
for i in range(0,int(testInput)):
testCaseInput = testCase.readline(-6)
testCaseOutput = testCase.readline(-7)
The next step would be to strip the numbers on the basis of (','), and then put them in a list.
Weirdly, the readline(-6) is not giving desired results.
Is there a better way to do this, which obviously I'm missing out on.
I don't mind using serialization here but I want to make it very simple for someone to write a text file as the one I have shown and then take the data out of it. How to do that?
A negative argument to the readline method specifies the number of bytes to read. I don't think this is what you want to be doing.
Instead, it is simpler to pull everything into a list all at once with readlines():
with open('data.txt') as f:
full_lines = f.readlines()
# parse full lines to get the text to right of "-"
lines = [line.partition('-')[2].rstrip() for line in full_lines]
numcases = int(lines[0])
for i in range(1, len(lines), 2):
caseinput = lines[i]
caseoutput = lines[i+1]
...
The idea here is to separate concerns (the source of the data, the parsing of '-', and the business logic of what to do with the cases). That is better than having a readline() and redundant parsing logic at every step.
I'm not sure if I follow exactly what you're trying to do, but I guess I'd try something like this:
testCaseIn = [];
testCaseOut = [];
for line in testInput:
if (line.startsWith("Input")):
testCaseIn.append(giveMeAList(line.split("-")[1]));
elif (line.startsWith("Output")):
testCaseOut.append(giveMeAList(line.split("-")[1]));
where giveMeAList() is a function that takes a comma seperated list of numbers, and generates a list datathing from it.
I didn't test this code, but I've written stuff that uses this kind of structure when I've wanted to make configuration files in the past.
You can use regex for this and it makes it much easier. See question: python: multiline regular expression
For your case, try this:
import re
s = open("input.txt","r").read()
(inputs,outputs) = zip(*re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s))
and then split(",") each output element as required
If you do it this way you get the benefit that you don't need the first line in your input file so you don't need to specify how many entries you have in advance.
You can also take away the unzip (that's the zip(*...) ) from the code above, and then you can deal with each input and output a pair at a time. My guess is that is in fact exactly what you are trying to do.
EDIT Wanted to give you the full example of what I meant just then. I'm assuming this is for a testing script so I would say use the power of the pattern matching iterator to help keep your code shorter and simpler:
for (input,output) in re.findall(r"Input-(?P<input>.*)\nOutput-(?P<output>.*)\n",s):
expectedResults = output.split(",")
testResults = runTest(input)
// compare testResults and expectedResults ...
This line has an error:
Ouput-1,1,2,3,5,8,13 // it should be 'Output' not 'Ouput
This should work:
testCase = open('in.txt', 'r')
testInput = int(testCase.readline().replace("TestCases-",""))
for i in range(0,int(testInput)):
testCaseInput = testCase.readline().replace("Input-","")
testCaseOutput = testCase.readline().replace("Output-","").split(",")
How to check if two XML files are equivalent?
For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.
<a>
<b>hello</b>
<c><d>world</d></c>
</a>
<a>
<c><d>world</d></c>
<b>hello</b>
</a>
Are there tools for this out there?
It all depends on your definition of "equivalent".
Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())
You might even want to ignore whitespace nodes, doing something like:
set(i for i in tree.getroot().itertext() if i.strip())
Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)
But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).
from lxml import etree
tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')
set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())
print set1 == set2
Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)
Also, this way of doing it is probably not recommended with very large files.
But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!
Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).
my solution is below. compare all attributes,tags iteration.
Some code refered from : Testing Equivalence of xml.etree.ElementTree
import xml.etree.ElementTree as ET
def elements_equal(e1, e2):
if e1.tag != e2.tag:
return False
if e1.text != e2.text:
if e1.text!=None and e2.text!=None :
return False
if e1.tail != e2.tail:
if e1.tail!=None and e2.tail!=None:
return False
if e1.attrib != e2.attrib:
return False
if len(e1) != len(e2):
return False
return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
def is_two_xml_equal(f1, f2):
tree1 = ET.parse(f1)
root1 = tree1.getroot()
tree2 = ET.parse(f2)
root2 = tree2.getroot()
return elements_equal(root1,root3)
f1 = '2.xml'
f2 = '1.xml'
print(is_two_xml_equal(f1, f2))