Below is the sample XML file consisting of 3 data-sources. In each data-source there is a tag having an attribute .
Now, out of 3 data-sources, 2 of them didn't have the attribute and one of the data-source have but the value is false.
I want to add the attribute in the missing one and modify its values to true in data-source where its present.
SAMPLE XML snippets:
Using DOM
# import minidom
import xml.dom.minidom as mdom
# open with minidom parser
DOMtree = mdom.parse("Input.xml")
data_set = DOMtree.documentElement
# get all validation elements from data_Set
validations = data_set.getElementsByTagName("validation")
# read all validation from validations
for validation in validations:
# get the element by tag anme
use-fast-fail = validation.getElementsByTagName('use-fast-fail')
# if the tag exist
if use-fast-fail:
if use-fast-fail[0].firstChild.nodeValue == "false":
# if tag value is false replace with true
use-fast-fail[0].firstChild.nodeValue = "true"
# if tag does not exist add tag
else:
newTag = DOMtree.createElement("use-fast-fail")
newTag.appendChild(DOMtree.createTextNode("true"))
validation.appendChild(newTag)
# write into output file
with open("Output.xml", 'w') as output_xml:
output_xml.write(DOMtree.toprettyxml())
Using simple file read and string search with regex
# import regex
import re
#open the input file with "read" option
input_file = open("intput.xml", "r")
#put content into a list
contents = input_file.readlines()
#close the file
input_file.close()
#loop to check the file line by line
#for every entry - get the index and value
for index, value in enumerate(contents):
#searches the "value" contains attribute with false value
if (re.search('<background-validation>false<background-validation/>',value)):
#if condition true True - changes to desired value
contents[index] = "<background-validation>true<background-validation/>\n"
#searches the "value" contains attribute, which always comes just before the desired attribute
if (re.search('validate-on-match',value)):
#searches the "value" of next element in the list contains attribute
if not (re.search('<background-validation>"',contents[index + 1])):
#if not adding the attribute
contents.insert(index + 1, "<background-validation>true<background-validation/>\n")
#open the file with "write" option
output_file = open("Output.xml", "w")
#joining all contents
contents = "".join(contents)
#write into output file
output_file.write(contents)
output_file.close()
Note: in the second options, addition of line if does not exist is given in an assumption that all data-source block is in same structure and order, else we may need to check multiple conditions.
Related
My program is parsing values from an XML file and then puts them into a dictionary.
Here I've used a for loop to iterate all tags from the file and attributes and also the text
But when there is a subtag like [250][155] which is <name>, it will overwrite the [4] <name>
And all of this is running under the for loop
Now, I want to hinder the loop from overwriting the values once it has been entered into the loop
import pprint as p # Importing pprint/pretty print for formatting dict
import xml.etree.ElementTree as ETT # Importing xml.etree for xml parsing
import csv # Importing csv to write in CSV file
def fetch_data():
# Asking input from user for the file path
xml_file_path = input(
"Enter the path to the file. \n Note* use 'Double back slash' instead of 'single back slash': \n \t \t \t \t \t")
# Initiating variable parser which will parse from the file
parser = ETT.parse(xml_file_path)
# Initiating variable root to get root element which will be useful in further queries
root = parser.getroot()
# Initiating variable main_d which is the dictionary in which the parsed data will be stored
main_d = {}
for w in root.iter(): # Making a for loop which will iter all tags out of the file
value = w.attrib # Initiating variable value for storing attributes where attributes are in the form of dictionaries
value['value'] = w.text # Hence, appending the text/value of the tag in the value dict
if w not in main_d:
main_d[w.tag] = value # Writing all the keys and values in main_d
else:
main_d.pop(w)
p.pprint(main_d, sort_dicts=False, width=200, depth=100)
fetch_data()
This is what the XML would look like
<?xml version="1.0" encoding="UTF-8"?>
<Data data_version="1">
<modified_on_date>some_time</modified_on_date>
<file_version>some version</file_version>
<name>h</name>
<class>Hot</class>
<fct>
<fc_tem di="value1" un="value2" unn="value3">some integer</fc_tem>
<fc_str di="value1" un="value2" unn="value3">some integer</fc_str>
<DataTable name="namee" type="0" columns="2" rows="2" version="some version">
<name>this will be overwritten on the first one up there</name>
<type>0</type>
</DataTable>
</fct>
</Data>
This is my progress so far
Taking into account the confidentiality of the program, that's all I can share
First of all, thanks to #PatrickArtner, his way worked
so you just have to do w.tag instead of w
the full snippet is:
# This program is to fetch/parse data(tags, attribs, text) from the XML/XMT
# file provided
# Importing required libraries
import pprint as p # Importing pprint/pretty print for formatting dict
import xml.etree.ElementTree as ETT # Importing xml.etree for xml parsing
import csv # Importing csv to write in CSV file
# Creating a method/function fetch_data() to fetch/parse data from the given XML/XMT file
def fetch_data():
# Asking input from user for the file path
xml_file_path = input(
"Enter the path to the file \n \t \t :")
# Asking input from user for the name of the csv file which will be created
file_name = input(str("Enter the file name with extension you want as output \n \t \t : "))
# Initiating variable parser which will parse from the file
parser = ETT.parse(xml_file_path)
# Initiating variable root to get root element which will be useful in further queries
root = parser.getroot()
# Initiating variable main_d which is the dictionary in which the parsed data will be stored
main_d = {}
for w in root.iter(): # Making a for loop which will iter all tags out of the file
value = w.attrib # Initiating variable value for storing attributes where attributes are in the form of dictionaries
value['value'] = w.text # Hence, appending the text/value of the tag in the value dict
if w.tag not in main_d: # Checking if the tag exists or not, this will help to avoid overwriting of tag values
main_d[w.tag] = value # Writing all the keys and values in main_d
else:
pass
p.pprint(main_d, sort_dicts=False, width=200) # This is just to check the output
with open(file_name, 'w+', buffering=True) as file: # Opening a file with the filename provided by the user
csvwriter = csv.writer(file, quoting=csv.QUOTE_ALL) # Initiating a variable csvwriter for the file and passing QUOTE_ALL agr.
for x in main_d.keys(): # Creating a loop to write the tags
csvwriter.writerow({x}) # Writing the tags
fetch_data()
What I'm looking to achieve:
The code added below filters through a parsed HTML page looking for specific values. Each specific value is then added to its own specific list in the form of a dictionary. Once all the values are added to the lists the dictionaries within are then combined into a JSON blob that I can then export.
Note - This is part of a quick PoC, so it was written quick and dirty. Forgive me.
My problem:
When the following lists dictionaries are combined I do not encounter any issues when export the blob:
jobs
names
dates
summaries
However, when the locations list is added in order to be combined into the blob an IndexError exception is encountered. As shown in the image below:
IndexError Encountered
My Analysis:
I've found that sometimes the value is not found because it was not included in the parsed HTML for reason/s that I cannot control, ie. it was not added my the user when it was created. The issue in this case being that the len of the locations list being 14 whilst the len of the other lists being equal at 15 which is causing the IndexError exception when I combine the lists using a for loop.
My Question:
As shown in my code below, I'm trying to handle the issue by assigning a placeholder value, "null", when the scraped value is not found but for some reason the value is not applied and I still encounter the IndexError exception. Any help would be appreciated, thank you in advance.
My Code:
import ast
import sys
# Create empty lists [Global]
jobs = []
names = []
dates = []
summaries = []
locations = []
# Function - Ingest parsed HTML data | Filter out required values
def getJobs(parsedHTML):
# Loop - Get job title
for div in parsedHTML.find_all(name='h2', attrs={'class':'title'}):
for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
val = str(a.getText().strip())
if val is None:
locations.append({"job-title": "null"})
else:
dictItem = {"job-title": f"{val}"}
jobs.append(dictItem)
# Loop - Get job poster's name
for div in parsedHTML.find_all(name='div', attrs={'class':'sjcl'}):
for span in div.find_all(name='span', attrs={'class':'company'}):
val = str(span.getText().strip())
if val is None:
locations.append({"company-name": "null"})
else:
dictItem = {"company-name": f"{val}"}
names.append(dictItem)
# Loop - Get the date the job post was created
for div in parsedHTML.find_all(name='div', attrs={'class':'result-link-bar'}):
for span in div.find_all(name='span', attrs={'class':'date date-a11y'}):
val = str(span.getText().strip())
if val is None:
locations.append({"date-created": "null"})
else:
dictItem = {"date-created": f"{val}"}
dates.append(dictItem)
# Loop - Get short job description
for divParent in parsedHTML.find_all(name='div', attrs={'class':'result'}):
for divChild in divParent.find_all(name='div', attrs={'class':'summary'}):
val = str(divChild.getText().strip())
if val is None:
locations.append({"short-description": "null"})
else:
dictItem = {"short-description": f"{val}"}
summaries.append(dictItem)
# Loop - Get job location
for div in parsedHTML.find_all(name='div', attrs={'class':'sjcl'}):
for span in div.find_all(name='span', attrs={'class':'location'}):
val = str(span.getText().strip())
if val is None:
locations.append({"location": "null"})
else:
dictItem = {"location": f"{val}"}
locations.append(dictItem)
# Function - Generate test data
def testData(parsedHTML, typeProc):
# typeProc == True | Export data to text files
if typeProc:
#getJobs(parsedHTML)
with open("jobs.txt", "w") as file:
for line in jobs:
file.write(str(line))
file.write("\n")
file.close()
with open("names.txt", "w") as file:
for line in names:
file.write(str(line))
file.write("\n")
file.close()
with open("dates.txt", "w") as file:
for line in dates:
file.write(str(line))
file.write("\n")
file.close()
with open("summaries.txt", "w") as file:
for line in summaries:
file.write(str(line))
file.write("\n")
file.close()
with open("locations.txt", "w") as file:
for line in locations:
file.write(str(line))
file.write("\n")
file.close()
# typeProc == False | Import data from txt files, convert to dictionary and append to list
elif typeProc == False:
with open("jobs.txt", "r") as file:
content = file.readlines()
for i in range(len(content)):
content[i] = content[i].replace("\n", "")
content[i] = ast.literal_eval(content[i])
jobs.append(content[i])
file.close()
with open("names.txt", "r") as file:
content = file.readlines()
for i in range(len(content)):
content[i] = content[i].replace("\n", "")
content[i] = ast.literal_eval(content[i])
names.append(content[i])
file.close()
with open("dates.txt", "r") as file:
content = file.readlines()
for i in range(len(content)):
content[i] = content[i].replace("\n", "")
content[i] = ast.literal_eval(content[i])
dates.append(content[i])
file.close()
with open("summaries.txt", "r") as file:
content = file.readlines()
for i in range(len(content)):
content[i] = content[i].replace("\n", "")
content[i] = ast.literal_eval(content[i])
summaries.append(content[i])
file.close()
with open("locations.txt", "r") as file:
content = file.readlines()
for i in range(len(content)):
content[i] = content[i].replace("\n", "")
content[i] = ast.literal_eval(content[i])
locations.append(content[i])
file.close()
# Else | If this else is hit, something is greatly fvcked
else:
print("Function: testData | Error: if statement else output")
sys.exit(1)
# Function - Remove items from all lists
def wipeLists():
jobs.clear()
names.clear()
dates.clear()
summaries.clear()
locations.clear()
# Function - JSON Blob Generator
def genJSON(parsedHTML):
# Testing with cached local IRL data
#testData(parsedHTML, False)
getJobs(parsedHTML)
jsonBlob = []
# Merge dictionaries | Combining dictionaries into single object + Append to jsonBlob list
for i in range(len(jobs)):
sumObj = {**jobs[i], **names[i], **dates[i], **summaries[i], **locations[i]}
#sumObj = {**jobs[i], **names[i], **dates[i], **summaries[i]}
jsonBlob.append(sumObj)
return jsonBlob
Thank You #pavel for your notes on how to approach the issue. I found that the value I was looking for was actually a required field when it was created and for some reason I was just not getting the correct amount of values when I was filtering the parsed data.
I reviewed the source code of the page/s again and found that there was another field with the exact value I was looking for. So now instead of getting the text of a span-element inside the parent div, I am getting the custom data-* attribute value of the parent div-element. I have not encountered a single error whilst testing.
Updated Code:
# Loop - Get job location
for div in parsedHTML.find_all(name='div', attrs={'class':'sjcl'}):
for divChild in div.find_all(name='div', attrs={'class':'recJobLoc'}):
dictItem = {"location": f"{divChild['data-rc-loc']}"}
locations.append(dictItem)
Thank You to everyone who tried to help. This has been resolved.
So basically I want to create a variable that changes after every iteration of a for loop to be the same as the search term that is used in the for loop in question, is that possible? I explained better in the code I think.
with open ('lista1.txt','r') as file_1:
reader_0 = file_1.readlines() # Reads a list of searchterms,
# the first search term of this list is "gt-710".
for search in reader_0:
file_0 = search.replace("\n","") +".txt"
file_1 = str(file_0.strip())
try: #if the file named the same as the searchterm exists, read its contents
file = open(file_1,"r")
search = file.readlines() # How do I create a variable that
# changes names? for example I want the
# content of file readlines be saved in
# a variable called the same as the
# searchterm in this ase I want it to
# be gt-710 = file.readlines()...in the
# next iteration I want it to be
# next_search_term_in_the_list =
# file.readlines()..an so on...
print(str(search) + "I actually tried")
except: #if not, create it
file = open(file_1,"w")
file.write("hello")
print("I didnt")
file.close()
This is impossible in Python, but you can do something similar. Enter stage left, the DICTIONARY! A dictionary is like a list, but you set your own keys. Make it like this:
my_dict = {}
You can add to the dictionary like so:
my_dict["key"] = "value"
A way you could implement this into your code could be as follows:
the_dict = {}
with open ('lista1.txt','r') as file_1:
[...]
file = open(file_1,"r")
file_contents = file.readlines()
the_dict[search] = file_contents
print(str(file_contents) + "I actually tried")
[...]
I am trying to find product names in an xml file i downloaded. I have figured out how to display every result using a while loop. My problem is, i want to only display the first 10 results. Also, i need be able to call each result individually.
For example: print(read_xml_code.start_tag_5) would print the 5th product in the XML file.
print(read_xml_code.start_tag_10) would print the 10th
here is my code so far:
# Define the Static webpage XML file
static_webpage_1 = 'StaticStock/acoustic_guitar.html'
def Find_static_webpage_product_name():
# Open and read the contents of the first XML file
read_xml_code = open(static_webpage_1, encoding="utf8").read()
# Find and print the static page title.
start_tag = '<title><![CDATA['
end_tag = ']]></title>'
end_position = 0
starting_position = read_xml_code.find(start_tag, end_position)
end_position = read_xml_code.find(end_tag, starting_position)
while starting_position != -1 and end_position!= -1:
print(read_xml_code[starting_position + len(start_tag) : end_position]+ '\n')
starting_position = read_xml_code.find(start_tag, end_position)
end_position = read_xml_code.find(end_tag, starting_position)
#call function
Find_static_webpage_product_name()
There is an HTML parser in the python standard library (python 3):
https://docs.python.org/3/library/html.parser.html
You can easily wait for the tag event and do some counting with a member variable for instance.
Also, do not forget to close your resources (with open(static_webpage_1, encoding="utf8") as f:...)
1.I am trying to read an xml file between the tags "sanity_results"( look at for the input http://pastebin.com/p9H8GQt4) and print the output
2.for any line or part of line that has http:// or // I want it to append "a href" hyperlink tag to the link so that when I post to email they appear as hyperlinks in the email
Input file(results.xml)
http://pastebin.com/p9H8GQt4
def getsanityresults(xmlfile):
srstart=xmlfile.find('<Sanity_Results>')
srend=xmlfile.find('</Sanity_Results>')
sanity_results=xmlfile[srstart+16:srend].strip()
sanity_results = sanity_results.replace('\n','<br>\n')
return sanity_results
def main ():
xmlfile = open('results.xml','r')
contents = xmlfile.read()
testresults=getsanityresults(contents)
print testresults
for line in testresults:
line = line.strip()
//How to find if the line contains "http" or "\\" or "//" and append "a href"attribute
resultslis.append(link)
if name == 'main':
main()
Have a look at your error message:
AttributeError: 'file' object has no attribute 'find'
And then have a look at main(): you're feeding the result of open('results.xml', 'r') into getsanityresults. But open(...) returns a file object, whereas getsanityresults expects xmlfile to be a string.
You need to extract the contents of xmlfile and feed that inti getsanityresults.
To get the contents of a file, read [this bit of the python documentation]9http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects).
In particular, try:
xmlfile = open('results.xml', 'r')
contents = xmlfile.read() # <-- this is a string
testresults = getsanityresults(contents) # <-- feed a string into getsanityresults
# ... rest of code