How to parse XML data manually (without xml.etree) - python

I've been playing with XML data in a text file. Just some general stuff.
I played around with xml.etree and its commands, but now I am wondering how to get rid of the tags manually and write all that data into a new file.
I figure it would take a lot of str.splits or a loop to get rid of the tags.
I right now have this to start (not working, just copies the data):
def summarizeData(fileName):
readFile = open(fileName, "r").read()
newFile = input("")
writeFile = open(newFile, "w")
with open(fileName, "r") as file:
for tags in file:
Xtags = tags.split('>')[1].split('<')[0]
writeFile.write(readFile)
writeFile.close
So far it just copies the data, including the tags. I figured splitting the tags would do the trick, but it seems like it doesn't do anything. Would it be possible to do manually, or do I have to use xml.etree?

The reason you don't see any changes is that you're just writing the data you read from fileName into readFile in this line:
readFile = open(fileName, "r").read()
... straight back to writeFile in this line:
writeFile.write(readFile)
Nothing you do inbetween (with Xtags etc.) has any effect on readFile, so it's getting written back as-is.
Apart from that issue, which you could fix with a little work ... parsing XML is nowhere near as straightforward as you think it is. You have to think about tags which span multiple lines, angle brackets that can appear inside attribute values, comments and CDATA sections, and a host of other subtle issues.
In summary: use a real XML parser like xml.etree.

Related

Removing BOM characters after adding <?xml version="1.0" encoding="UTF-8"?> to SQL-generated XML file with Python

I'm using Microsoft SQL Server Management Studio to create an XML file. This file needs at the top to be uploaded properly. I understand that this is fairly normal and I need to figure out how to add that line myself.
To add the line, I'm calling each of my files and modifying them with the following function:
def append_prologue(file, orgID, schema):
timestamp = datetime.today().strftime('%Y%m%d')
new_name = f'{orgID}_000_2022TSDS_{timestamp}1500_' + schema
new_file = file.parent.parent / 'results/with_prologue' / new_name
if new_file.exists():
print(f'{new_file.name} already exists')
with open(file, 'r') as original:
data = original.read()
data = data[3:] #how the original writer dealt with the issue
with open(new_file, 'w+') as modified:
modified.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + data)
return
However, this creates a problem. It will write but it adds "\ufeff" which I understand to be a BOM and the XML file can't be read properly. I took over this project for a coworker who left my company and they wrote this code. They addressed the issue by removing the BOM but it doesn't seem to work for me. I also suspect there's probably a more systematic way of doing it.
What am I doing wrong? Is there a way to remove these characters when I write the file? Should I be approaching this differently?
Codecs package should do the trick.
StreamReader = codecs.getreader('utf-8-sig')
with StreamReader(open(file, 'rb')) as original:
...
Or much shorter version:
with codecs.open(file, 'r', 'utf-8-sig') as original:
...

Adding text at the beginning of multiple txt files into a folder. Problem of overwriting the text inside

im trying to add the same text at the beggining of all the txt files that are in a folder.
With this code i can do it, but there is a problem, i dont know why it overwrite part of the text that is at the beginning of each txt file.
output_dir = "output"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for f in glob.glob("*.txt"):
with open(f, 'r', encoding="utf8") as inputfile:
with open('%s/%s' % (output_dir, ntpath.basename(f)), 'w', encoding="utf8") as outputfile:
for line in inputfile:
outputfile.write(line.replace(line,"more_text"+line+"text_that_is_overwrited"))
outputfile.seek(0,io.SEEK_SET)
outputfile.write('text_that_overwrite')
outputfile.seek(0, io.SEEK_END)
outputfile.write("more_text")
The content of txt files that im trying to edit start with this:
here 4 spaces text_line_1
here 4 spaces text_line_2
The result is:
On file1.txt: text_that_overwriteited
On file1.txt: text_that_overwriterited
Your mental model of how writing a file works seems to be at odds with what's actually happening here.
If you seek back to the beginning of the file, you will start overwriting all of the file. There is no such thing as writing into the middle of a file. A file - at the level of abstraction where you have open and write calls - is just a stream; seeking back to the beginning of the stream (or generally, seeking to a specific position in the stream) and writing replaces everything which was at that place in the stream before.
Granted, there is a lower level where you could actually write new bytes into a block on the disk whilst that block still remains the storage for a file which can then be read as a stream. With most modern file systems, the only way to make this work is to replace that block with exactly the same amount of data, which is very rarely feasible. In other words, you can't replace a block containing 1024 bytes with data which isn't also exactly 1024 bytes. This is so marginally useful that it's simply not an operation which is exposed to the higher level of the file system.
With that out of the way, the proper way to "replace lines" is to not write those lines at all. Instead, write the replacement, followed by whichever lines were in the original file.
It's not clear from your question what exactly you want overwritten, so this is just a sketch with some guesses around that part.
output_dir = "output"
# prefer exist_ok=True over if not os.path.exists()
os.makedirs(output_dir, exist_ok=True)
for f in glob.glob("*.txt"):
# use a single with statement
# prefer os.path.basename over ntpath.basename; use os.path.join
with open(f, 'r', encoding="utf8") as inputfile, \
open(os.path.join(output_dir, os.path.basename(f)), 'w', encoding="utf8") as outputfile:
for idx, line in enumerate(inputfile):
if idx == 0:
outputfile.write("more text")
outputfile.write(line.rstrip('\n'))
outputfile.write("text that is overwritten\n")
continue
# else:
outputfile.write(line)
outputfile.write("more_text\n")
Given an input file like
here is some text
here is some more text
this will create an output file like
more texthere is some texttext that is overwritten
here is some more text
more_text
where the first line is a modified version of the original first line, and a new line is appended after the original file's contents.
I found this elsewhere on StackOverflow. Why does my text file keep overwriting the data on it?
Essentially, the w mode is meant to overwrite text.
Also, you seem to be writing a sitemap manually. If you are using a web framework like Flask or Django, they have plugin or built-in support for auto-generated sitemaps — you should use that instead. Alternatively, you could create an XML template for the sitemap using Jinja or DTL. Templates are not just for HTML files.

Python not printing out special characters (extracted from an html file) when I write to another html file

I am extracting data from an html file and outputting it to another html file template using .replace. I wrote it so that on double clicking my script, the page opens up in a browser, ready to be printed.
Everything works fine until I ran into an extracted string that had a special character in it. On double click, nothing would happen (the web browser would not open). However, it seems to work when I run it straight from IDLE, with one issue: The special character comes up as a weird combination of characters.
I haven't tested this out with other special characters, but my problem right now is happening with Nyström, which comes up as Nyström in my outputted file.
I figure this has something to do with encoding/decoding in 'utf-8', however I do not know enough about the subject to solve this issue myself post research.
When I open the read and write files, I make sure they have encoding='utf-8' as the third argument.
Finally, when I print the string i'm having trouble with out onto IDLE, it comes out fine. The issue just seems to pop up when I write it to my file.
Below are my file read and write calls if that helps
path = os.path.dirname(os.path.realpath(__file__))
htmlFile = open(path + input_filename, "r", encoding="utf-8")
htmlString = htmlFile.read()
infile = open(template_path, 'r', encoding='utf-8')
contents = infile.read()
After this I .replace certain parts of content with my extracted strings put into a dictionary named data.
eg:
(please ignore inconsistent naming conventions)
data = dict()
data['name_email'] = email
contents = contents.replace('_name_email', data['name_email'])
then:
outfile = open(output_filename, 'w', encoding='utf-8')
outfile.write(contents)
I am running this on python 3.6

Replace string in specific line of nonstandard text file

Similar to posting: Replace string in a specific line using python, however results were not forethcomming in my slightly different instance.
I working with python 3 on windows 7. I am attempting to batch edit some files in a directory. They are basically text files with .LIC tag. I'm not sure if that is relevant to my issue here. I am able to read the file into python without issue.
My aim is to replace a specific string on a specific line in this file.
import os
import re
groupname = 'Oldtext'
aliasname = 'Newtext'
with open('filename') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
f.writelines(data[1])
print(data[1])
print('done')
When running the above code I get an UnsupportedOperation: not writable. I am having some issue writing the changes back to the file. Based on suggestion of other posts, I edited added the w option to the open('filename', "w") function. This causes all text in the file to be deleted.
Based on suggestion, the r+ option was tried. This leads to successful editing of the file, however, instead of editing the correct line, the edited line is appended to the end of the file, leaving the original intact.
Writing a changed line into the middle of a text file is not going to work unless it's exactly the same length as the original - which is the case in your example, but you've got some obvious placeholder text there so I have no idea if the same is true of your actual application code. Here's an approach that doesn't make any such assumption:
with open('filename', 'r') as f:
data = f.readlines()
data[1] = re.sub(groupname,aliasname, data[1])
with open('filename', 'w') as f:
f.writelines(data)
EDIT: If you really wanted to write only the single line back into the file, you'd need to use f.tell() BEFORE reading the line, to remember its position within the file, and then f.seek() to go back to that position before writing.

Writing to XML - Missing lines

Python/Programming beginner. Working on parsing/extracting data from XML.
Goal: Take malformed xml (multiple xml files in one .data file) and write it to individual xml files. FYI - each xml begins with the same declaration in the file, and there are a total of 4.
Approach (1) I use readlines() on the file (2) Find the index of each xml's declaration (3) loop through xml list slices, writing each line to file. Code below, apologies if it sucks :)
For i, x in enumerate(decl_indxs):
xml_file = open(file, 'w')
if i == 4:
for line in file_lines[x:]:
xml_file.write(line)
else:
for line in file_lines[x:decl_indxs[i+1]]:
xml_file.write(line)
Problem The first 3 xmls are created without issue. The 4th xml only writes the first 238 of 396 lines.
Troubleshooting I modified the code to print out the list slice used for the final loop and it was good. I also looped through just the 4th list slice and it outputs correctly.
Help Can anyone explain why this would happen? Would also be great to get advice on improving my approach. The more info the better. Thanks
I don't think that your approach of finding indexes is good, most probably you messed up with indexes somewhere. And good news, that it's actually not easy to debug, because there are a lot of meaningless integer values. I would try to advice you some useful approaches here.
As far I understand your issue, you need
Use with context manager to open your original file with multiple XMLs.
Split the content of the original file to several string variables, based on finding known declaration header string <?xml.
Write individual XML valid strings to individual files also using with context manager.
If you need further work with these XMLs, you definitely should look to specialized XML parsers (xml.etree, lxml), and never work with them as with strings.
Code example:
def split_to_several_xmls(original_file_path):
# assuming that original source is correctly formatted, i.e. each line starts with "<" (omitting spaces)
# otherwise you need to parse by chars not by lines
with open(original_file_path) as f:
data = f.read()
resulting_lists = []
for line in data.split('\n'):
if not line or not line.strip(): # ignore empty lines
continue
if line.strip().startswith('<?xml '):
resulting_lists.append([]) # create new list to write lines to new xml chunk
if not resulting_lists: # ignore everything before first xml decalartion
continue
resulting_lists[-1].append(line) # write current line to current xml chunk
resulting_strings = ['\n'.join(e) for e in resulting_lists]
# i.e. backwardly convert lines to strings - each one string is one valid xml chunk in the result
return resulting_strings
def save_xmls(xml_strings, filename_base):
for i, xml_string in enumerate(xml_strings):
filename = '{base}{i}.xml'.format(base=filename_base, i=i)
with open(filename, mode='w') as f:
f.write(xml_string)
def main():
xml_strings = split_to_several_xmls('original.txt') # file with multiple xmls in one file
save_xmls(xml_strings, 'result')
if __name__ == '__main__':
main()

Categories