Python/Programming beginner. Working on parsing/extracting data from XML.
Goal: Take malformed xml (multiple xml files in one .data file) and write it to individual xml files. FYI - each xml begins with the same declaration in the file, and there are a total of 4.
Approach (1) I use readlines() on the file (2) Find the index of each xml's declaration (3) loop through xml list slices, writing each line to file. Code below, apologies if it sucks :)
For i, x in enumerate(decl_indxs):
xml_file = open(file, 'w')
if i == 4:
for line in file_lines[x:]:
xml_file.write(line)
else:
for line in file_lines[x:decl_indxs[i+1]]:
xml_file.write(line)
Problem The first 3 xmls are created without issue. The 4th xml only writes the first 238 of 396 lines.
Troubleshooting I modified the code to print out the list slice used for the final loop and it was good. I also looped through just the 4th list slice and it outputs correctly.
Help Can anyone explain why this would happen? Would also be great to get advice on improving my approach. The more info the better. Thanks
I don't think that your approach of finding indexes is good, most probably you messed up with indexes somewhere. And good news, that it's actually not easy to debug, because there are a lot of meaningless integer values. I would try to advice you some useful approaches here.
As far I understand your issue, you need
Use with context manager to open your original file with multiple XMLs.
Split the content of the original file to several string variables, based on finding known declaration header string <?xml.
Write individual XML valid strings to individual files also using with context manager.
If you need further work with these XMLs, you definitely should look to specialized XML parsers (xml.etree, lxml), and never work with them as with strings.
Code example:
def split_to_several_xmls(original_file_path):
# assuming that original source is correctly formatted, i.e. each line starts with "<" (omitting spaces)
# otherwise you need to parse by chars not by lines
with open(original_file_path) as f:
data = f.read()
resulting_lists = []
for line in data.split('\n'):
if not line or not line.strip(): # ignore empty lines
continue
if line.strip().startswith('<?xml '):
resulting_lists.append([]) # create new list to write lines to new xml chunk
if not resulting_lists: # ignore everything before first xml decalartion
continue
resulting_lists[-1].append(line) # write current line to current xml chunk
resulting_strings = ['\n'.join(e) for e in resulting_lists]
# i.e. backwardly convert lines to strings - each one string is one valid xml chunk in the result
return resulting_strings
def save_xmls(xml_strings, filename_base):
for i, xml_string in enumerate(xml_strings):
filename = '{base}{i}.xml'.format(base=filename_base, i=i)
with open(filename, mode='w') as f:
f.write(xml_string)
def main():
xml_strings = split_to_several_xmls('original.txt') # file with multiple xmls in one file
save_xmls(xml_strings, 'result')
if __name__ == '__main__':
main()
Related
I am writing a Streamlit app that takes in tensor output data from a .txt file, formats it, and both shows information on the data and prints the formatted data back to a new .txt file for later use.
After uploading the txt file to Streamlit and decoding it to a single long string, I alter the string and write it to a new txt file. When I open the txt file, the line spacings are huge, it looks like extra newlines have been put in but when you highlight the text, it is just large line spacings.
As well as this, when I use splitlines() on the string, the array that is returned is empty. This is the case even though the string is not empty and does contain newlines - I think it is to do with the large line spacings, but I am not sure.
The program is split into modules, but the code that is meant to format the file is in just two functions. One adds delimiters and works like this (with Streamlit as st):
def delim(file):
#read the selected file and write it to variable elems as a string
elems = file.decode('utf-8')
#replace the applicable parts of variable elems with the delimiters
elems = elems.replace('e+002', 'e+002, ')
elems = elems.replace('e+003', 'e+003, ')
elems = elems.replace('e+004', 'e+004, ')
elems = elems.replace('e+005', 'e+005, ')
elems = elems.replace('e+006', 'e+006, ')
elems = elems.replace('e+007', 'e+007, ')
elems = elems.replace('e+008', 'e+008, ')
elems = elems.replace('e+009', 'e+009, ')
with open('final_file.txt', 'w') as magma_file:
#write a txt file with the stored, altered text in variable elems
magma_file.write(elems)
#close the writeable file to be safe
magma_file.close()
st.success('Delimiters successfully added')
The second part, where I am getting the empty array, is in a second function. The whole function is not necessary to see the issue, but the part that is not working is here:
def addElem(file):
#create counting variables
counter = 0
linecount = 1
#put file as string in variable checks
checks = file.decode('utf-8')
checks.splitlines()
#check to see if the start of the file is formatted correctly. This is the part giving me strife
if checks[0].rstrip().endswith('5'):
with open('final_file.txt', 'w') as ff:
#iterate through the lines in the file
for line in checks:
counter+=1
# and so on, not relevant to the problem
The variable checks does contain a string after decoding the file, but when I use splitlines() then look inside checks[0], checks[1] etc., they are all empty. I tried commenting out other code, the conditional statement, removing the rstrip() and just seeing what was in the checks array after splitting the string, but it was still nothing. I tried changing splitlines() to split() using various delimiters including \n, but the array remained empty.
This program logic worked perfectly when I was running it locally using a console application interacting directly with the file system, so probably the problem is something to do with how a Streamlit "file like object" works. I read through the docs at Streamlit, but it doesn't give much detail on this.
This program is not for my use, so I can't keep it as a console app. I did ask about this on the Streamlit community a month ago, but so far no one has answered and I am not sure whether it is an unusual problem or just a terrible question.
I am wondering if there is a better way to decode the file to a string, but decoding to unicode doesn't explain the line spacings so I think something else is going on.
I have a large .txt file that is a result of a C-file being parsed containing various blocks of data, but about 90% of them are useless to me. I'm trying to get rid of them and then save the result to another file, but have hard time doing so. At first I tried to delete all useless information in unparsed file, but then it won't parse. My .txt file is built like this:
//Update: Files I'm trying to work on comes from pycparser module, that I found on a GitHub.
File before being parsed looks like this:
And after using pycparser
file_to_parse = pycparser.parse_file(current_directory + r"\D_Out_Clean\file.d_prec")
I want to delete all blocks that starts with word Typedef. This module stores this in an one big list that I can access via it's attribute.
Currently my code looks like this:
len_of_ext_list = len(file_to_parse.ext)
i = 0
while i < len_of_ext_list:
if 'TypeDecl' not in file_to_parse.ext[i]:
print("NOT A TYPEDECL")
print(file_to_parse.ext[i], type(file_to_parse.ext[i]))
parsed_file_2 = open(current_directory + r"\Zadanie\D_Out_Clean_Parsed\clean_file.d_prec", "w+")
parsed_file_2.write("%s%s\n" %("", file_to_parse.ext[i]))
parsed_file_2.close
#file_to_parse_2 = file_to_parse.ext[i]
i+=1
But above code only saves one last FuncDef from a unparsed file, and I don't know how to change it.
So, now I'm trying to get rid of all typedefs in parsed file as they don't have any valuable information for me. I want to now what functions definitions and declarations are in file, and what type of global variables are stored in parsed file. Hope this is more clear now.
I suggest reading the entire input file into a string, and then doing a regex replacement:
with open(current_directory + r"\D_Out\file.txt", "r+") as file:
with open(current_directory + r"\D_Out_Clean\clean_file.txt", "w+") as output:
data = file.read()
data = re.sub(r'type(?:\n\{.*?\}|[^;]*?;)\n?', '', data, flags=re.S)
output.write(line)
Here is a regex demo showing that the replacement logic is working.
In Think Python by Allen Downey the excersise 13-2 asks to process any .txt file from gutenberg.org and skip the header information which end with something like "Produced by". This is the solution that author gives:
def process_file(filename, skip_header):
"""Makes a dict that contains the words from a file.
box = temp storage unit to combine two following word in one string
res = dict
filename: string
skip_header: boolean, whether to skip the Gutenberg header
returns: map from string of two word from file to list of words that comes
after them
Last two word in text maps to None"""
res = {}
fp = open(filename)
if skip_header:
skip_gutenberg_header(fp)
for line in fp:
process_line(line, res)
return res
def process_line(line, res):
for word in line.split():
word = word.lower().strip(string.punctuation)
if word.isalpha():
res[word] = res.get(word, 0) + 1
def skip_gutenberg_header(fp):
"""Reads from fp until it finds the line that ends the header.
fp: open file object
"""
for line in fp:
if line.startswith('Produced by'):
break
I really don't understand the flaw of execution in this code. Once the code starts reading the file using skip_gutenberg_header(fp) which contains "for line in fp:"; it finds needed line and breaks. However next loop picks up right where break statement left. But why? My vision of it is that there are two independent iterations here both containing "for line in fp:",
so shouldn't second one start form the beginning?
No, it shouldn't re-start from the beginning. An open file object maintains a file position indicator, which gets moved as you read (or write) the file. You can also move the position indicator via the file's .seek method, and query it via the .tell method.
So if you break out of a for line in fp: loop you can continue reading where you left off with another for line in fp: loop.
BTW, this behaviour of files isn't specific to Python: all modern languages that inherit C's notion of streams and files work like this.
The .seek and .tell methods are mentioned briefly in the tutorial.
For a more in-depth treatment of file / stream handling in Python, please see the docs for the io module. There's a lot of info in that document, and some of that information is mainly intended for advanced coders. You will probably need to read it several times and write a few test programs to absorb what it says, so feel free to skim through it the first time you try to read... or the first few times. ;)
My vision of it is that there are two independent iterations here both containing "for line in fp:", so shouldn't second one start form the beginning?
If fp were a list, then of course they would. However it's not -- it's just an iterable. In this case it's a file-like object that has methods like seek, tell, and read. In the case of file-like objects, they keep state. When you read a line from them, it changes the position of the read pointer in the file, so the next read starts a line below.
This is commonly used to skip the header of tabular data (when you're not using a csv.reader, at least)
with open("/path/to/file") as f:
headers = next(f).strip() # first line
for line in f:
# iterate by-line for the rest of the file
...
I'm new to Python. My second time coding in it. The main point of this script is to take a text file that contains thousands of lines of file names (sNotUsed file) and match it against about 50 XML files. The XML files may contain up to thousands of lines each and are formatted as most XML's are. I'm not sure what the problem with the code so far is. The code is not fully complete as I have not added the part where it writes the output back to an XML file, but the current last line should be printing at least once. It is not, though.
Examples of the two file formats are as follows:
TEXT FILE:
fileNameWithoutExtension1
fileNameWithoutExtension2
fileNameWithoutExtension3
etc.
XML FILE:
<blocks>
<more stuff="name">
<Tag2>
<Tag3 name="Tag3">
<!--COMMENT-->
<fileType>../../dir/fileNameWithoutExtension1</fileType>
<fileType>../../dir/fileNameWithoutExtension4</fileType>
</blocks>
MY CODE SO FAR:
import os
import re
sNotUsed=list()
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open snotused txt file
for lines in sFile:
sNotUsed.append(lines)
#sNotUsed = sFile.readlines() # read all lines and assign to list
sFile.close() # close file
xmlFiles=list() # list of xmlFiles in directory
usedS=list() # list of S files that do not match against sFile txt
search = "\w/([\w\-]+)"
# getting the list of xmlFiles
filelist=os.listdir('C:\Users\xxx\Desktop\dir')
for files in filelist:
if files.endswith('.xml'):
xmlFile = open(files, "r+") # open first file with read + write access
xmlComp = xmlFile.readlines() # read lines and assign to list
for lines in xmlComp: # iterate by line in list of lines
temp = re.findall(search, lines)
#print temp
if temp:
if temp[0] in sNotUsed:
print "yes" # debugging. I know there is at least one match for sure, but this is not being printed.
TO HELP CLEAR THINGS UP:
Sorry, I guess my question wasn't very clear. I would like the script to go through each XML line by line and see if the FILENAME part of that line matches with the exact line of the sNotUsed.txt file. If there is match then I want to delete it from the XML. If the line doesn't match any of the lines in the sNotUsed.txt then I would like it be part of the output of the new modified XML file (which will overwrite the old one). Please let me know if still not clear.
EDITED, WORKING CODE
import os
import re
import codecs
sFile = open("C:\Users\xxx\Desktop\sNotUsed.txt", "r") # open sNotUsed txt file
sNotUsed=sFile.readlines() # read all lines and assign to list
sFile.close() # close file
search = re.compile(r"\w/([\w\-]+)")
sNotUsed=[x.strip().replace(',','') for x in sNotUsed]
directory=r'C:\Users\xxx\Desktop\dir'
filelist=os.listdir(directory) # getting the list of xmlFiles
# for each file in the list
for files in filelist:
if files.endswith('.xml'): # make sure it is an XML file
xmlFile = codecs.open(os.path.join(directory, files), "r", encoding="UTF-8") # open first file with read
xmlComp = xmlFile.readlines() # read lines and assign to list
print xmlComp
xmlFile.close() # closing the file since the lines have already been read and assigned to a variable
xmlEdit = codecs.open(os.path.join(directory, files), "w", encoding="UTF-8") # opening the same file again and overwriting all existing lines
for lines in xmlComp: # iterate by line in list of lines
#headerInd = re.search(search, lines) # used to get the headers, comments, and ending blocks
temp = re.findall(search, lines) # finds all strings that match the regular expression compiled above and makes a list for each
if temp: # if the list is not empty
if temp[0] not in sNotUsed: # if the first (and only) value in each list is not in the sNotUsed list
xmlEdit.write(lines) # write it in the file
else: # if the list is empty
xmlEdit.write(lines) # write it (used to preserve the beginning and ending blocks of the XML, as well as comments)
There is a lot of things to say but I'll try to stay concise.
PEP8: Style Guide for Python Code
You should use lower case with underscores for local variables.
take a look at the PEP8: Style Guide for Python Code.
File objects and with statement
Use the with statement to open a file, see: File Objects: http://docs.python.org/2/library/stdtypes.html#bltin-file-objects
Escape Windows filenames
Backslashes in Windows filenames can cause problems in Python programs. You must escape the string using double backslashes or use raw strings.
For example: if your Windows filename is "dir\notUsed.txt", you should escape it like this: "dir\\notUsed.txt" or use a raw string r"dir\notUsed.txt". If you don't do that, the "\n" will be interpreted as a newline!
Note: if you need to support Unicode filenames, you can use Unicode raw strings: ur"dir\notUsed.txt".
See also the question 19065115 in StockOverFlow.
store the filenames in a set: it is an optimized collection without duplicates
not_used_path = ur"dir\sNotUsed.txt"
with open(not_used_path) as not_used_file:
not_used_set = set([line.strip() for line in not_used_file])
Compile your regex
It is more efficient to compile a regex when used numerous times. Again, you should use raw strings to avoid backslashes interpretation.
pattern = re.compile(r"\w/([\w\-]+)")
Warning: os.listdir() function return a list of filenames not a list of full paths. See this function in the Python documentation.
In your example, you read a desktop directory 'C:\Users\xxx\Desktop\dir' with os.listdir(). And then you want to open each XML file in this directory with open(files, "r+"). But this is wrong, until your current working directory isn't your desktop directory. The classic usage is to used os.path.join() function like this:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
desktop_path = os.path.join(desktop_dir, filename)
If you want to extract the filename's extension, you can use the os.path.splitext() function.
desktop_dir = r'C:\Users\xxx\Desktop\dir'
for filename in os.listdir(desktop_dir):
if os.path.splitext(filename)[1].lower() != '.xml':
continue
desktop_path = os.path.join(desktop_dir, filename)
You can simplify this with a comprehension list:
desktop_dir = r'C:\Users\xxx\Desktop\dir'
xml_list = [os.path.join(desktop_dir, filename)
for filename in os.listdir(desktop_dir)
if os.path.splitext(filename)[1].lower() == '.xml']
Parse a XML file
How to parse a XML file? This is a great question!
There a several possibility:
- use regex, efficient but dangerous;
- use SAX parser, efficient too but confusing and difficult to maintain;
- use DOM parser, less efficient but clearer...
Consider using lxml package (#see: http://lxml.de/)
It is dangerous, because the way you read the file, you don't care of the XML encoding. And it is bad! Very bad indeed! XML files are usually encoded in UTF-8. You should first decode UTF-8 byte stream. A simple way to do that is to use codecs.open() to open an encoded file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
With this solution, the full XML content is store in the content variable as an Unicode string. You can then use a Unicode regex to parse the content.
Finally, you can use a set intersection to find if a given XML file contains commons names with the text file.
for xml_path in xml_list:
with codecs.open(xml_path, "r", encoding="UTF-8") as xml_file:
content = xml_file.read()
actual_set = set(pattern.findall(content))
print(not_used_set & actual_set)
The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.
Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)
To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.
Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.
I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.