Parsing the xml file to create hyperlinks - python

1.I am trying to read an xml file between the tags "sanity_results"( look at for the input http://pastebin.com/p9H8GQt4) and print the output
2.for any line or part of line that has http:// or // I want it to append "a href" hyperlink tag to the link so that when I post to email they appear as hyperlinks in the email
Input file(results.xml)
http://pastebin.com/p9H8GQt4
def getsanityresults(xmlfile):
srstart=xmlfile.find('<Sanity_Results>')
srend=xmlfile.find('</Sanity_Results>')
sanity_results=xmlfile[srstart+16:srend].strip()
sanity_results = sanity_results.replace('\n','<br>\n')
return sanity_results
def main ():
xmlfile = open('results.xml','r')
contents = xmlfile.read()
testresults=getsanityresults(contents)
print testresults
for line in testresults:
line = line.strip()
//How to find if the line contains "http" or "\\" or "//" and append "a href"attribute
resultslis.append(link)
if name == 'main':
main()

Have a look at your error message:
AttributeError: 'file' object has no attribute 'find'
And then have a look at main(): you're feeding the result of open('results.xml', 'r') into getsanityresults. But open(...) returns a file object, whereas getsanityresults expects xmlfile to be a string.
You need to extract the contents of xmlfile and feed that inti getsanityresults.
To get the contents of a file, read [this bit of the python documentation]9http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects).
In particular, try:
xmlfile = open('results.xml', 'r')
contents = xmlfile.read() # <-- this is a string
testresults = getsanityresults(contents) # <-- feed a string into getsanityresults
# ... rest of code

Related

Python - FileNotFoundError, parameter appears to pull wrong path?

I'm trying to update a program to pull/read 10-K html and am getting a FileNotFound error. The error throws during the readHTML function. It looks like the FileName parameter is looking for a path to the Form10KName column, when it should be looking to the FileName column. I've no idea why this is happening, any help?
Here is the error code:
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 105, in <module>
main()
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 92, in main
match=readHTML(FileName)
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 18, in readHTML
input_file = open(input_path,'r+')
FileNotFoundError: [Errno 2] No such file or directory: './HTML/a10-k20189292018.htm'
And here is what I'm running.
from bs4 import BeautifulSoup #<---- Need to install this package manually using pip
from urllib.request import urlopen
os.chdir('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = "./HTML/" #<===The subfolder with the 10-K files in HTML format
txtSubPath = "./txt/" #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath+FileName
output_path = txtSubPath+FileName.replace(".htm",".txt")
input_file = open(input_path,'r+')
page = input_file.read() #<===Read the HTML file into Python
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip() #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
while ' ' in page:
page = page.replace(' ', ' ') #<===remove extra space
#Using regular expression to extract texts that match a pattern
#Define pattern for regular expression.
#The following patterns find ITEM 1 and ITEM 1A as diplayed as subtitles
#(.+?) represents everything between the two subtitles
#If you want to extract something else, here is what you should change
#Define a list of potential patterns to find ITEM 1 and ITEM 1A as subtitles
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.', #<===pattern 1: with an attribute bold before the item subtitle
'b>\s*Item 1\.(.+?)b>\s*Item 1A\.', #<===pattern 2: with a tag <b> before the item subtitle
'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>', #<===pattern 3: with a tag <\b> after the item subtitle
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle
#Now we try to see if a match can be found...
for regex in regexs:
match = re.search (regex, page, flags=re.IGNORECASE) #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
#If a match exist....
if match:
#Now we have the extracted content still in an HTML format
#We now turn it into a beautiful soup object
#so that we can remove the html tags and only keep the texts
soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?)
#soup.text removes the html tags and only keep the texts
rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
#remove space at the beginning and end and the subtitle "business" at the beginning
#^ matches the beginning of the text
outText = re.sub("^business\s*","",rawText.strip(),flags=re.IGNORECASE)
output_file = open(output_path, "w")
output_file.write(outText)
output_file.close()
break #<=== if a match is found, we break the for loop. Otherwise the for loop continues
input_file.close()
return match
def main():
if not os.path.isdir(txtSubPath): ### <=== keep all texts files in this subfolder
os.makedirs(txtSubPath)
csvFile = open(DownloadLogFile, "r") #<===A csv file with the list of 10k file names (the file should have no header)
csvReader = csv.reader(csvFile, delimiter=",")
csvData = list(csvReader)
logFile = open(ReadLogFile, "a+") #<===A log file to track which file is successfully extracted
logWriter = csv.writer(logFile, quoting = csv.QUOTE_NONNUMERIC)
logWriter.writerow(["filename","extracted"])
i=1
for rowData in csvData[1:]:
if len(rowData):
FileName = rowData[7]
if ".htm" in FileName:
match=readHTML(FileName)
if match:
logWriter.writerow([FileName,"yes"])
else:
logWriter.writerow([FileName,"no"])
i=i+1
csvFile.close()
logFile.close()
print("done!")
if __name__ == "__main__":
main()
CSV of file info
Your error message explains it is not looking inside the "HTML" directory for the file.
I would avoid using os.chdir to change the working directory - it is likely to complicate things. Instead, use pathlib and join paths correctly to ensure file paths are less error prone.
Try with this:
from pathlib import Path
base_dir = Path('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = base_dir.joinpath("HTML") #<===The subfolder with the 10-K files in HTML format
txtSubPath = base_dir.joinpath("txt") #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath.joinpath(FileName)
output_path = txtSubPath.joinpath(FileName.replace(".htm",".txt"))

Edit attribute value and add new attribute in xml using python

Below is the sample XML file consisting of 3 data-sources. In each data-source there is a tag having an attribute .
Now, out of 3 data-sources, 2 of them didn't have the attribute and one of the data-source have but the value is false.
I want to add the attribute in the missing one and modify its values to true in data-source where its present.
SAMPLE XML snippets:
Using DOM
# import minidom
import xml.dom.minidom as mdom
# open with minidom parser
DOMtree = mdom.parse("Input.xml")
data_set = DOMtree.documentElement
# get all validation elements from data_Set
validations = data_set.getElementsByTagName("validation")
# read all validation from validations
for validation in validations:
# get the element by tag anme
use-fast-fail = validation.getElementsByTagName('use-fast-fail')
# if the tag exist
if use-fast-fail:
if use-fast-fail[0].firstChild.nodeValue == "false":
# if tag value is false replace with true
use-fast-fail[0].firstChild.nodeValue = "true"
# if tag does not exist add tag
else:
newTag = DOMtree.createElement("use-fast-fail")
newTag.appendChild(DOMtree.createTextNode("true"))
validation.appendChild(newTag)
# write into output file
with open("Output.xml", 'w') as output_xml:
output_xml.write(DOMtree.toprettyxml())
Using simple file read and string search with regex
# import regex
import re
#open the input file with "read" option
input_file = open("intput.xml", "r")
#put content into a list
contents = input_file.readlines()
#close the file
input_file.close()
#loop to check the file line by line
#for every entry - get the index and value
for index, value in enumerate(contents):
#searches the "value" contains attribute with false value
if (re.search('<background-validation>false<background-validation/>',value)):
#if condition true True - changes to desired value
contents[index] = "<background-validation>true<background-validation/>\n"
#searches the "value" contains attribute, which always comes just before the desired attribute
if (re.search('validate-on-match',value)):
#searches the "value" of next element in the list contains attribute
if not (re.search('<background-validation>"',contents[index + 1])):
#if not adding the attribute
contents.insert(index + 1, "<background-validation>true<background-validation/>\n")
#open the file with "write" option
output_file = open("Output.xml", "w")
#joining all contents
contents = "".join(contents)
#write into output file
output_file.write(contents)
output_file.close()
Note: in the second options, addition of line if does not exist is given in an assumption that all data-source block is in same structure and order, else we may need to check multiple conditions.

Reads And Updates XML in pycharm but not command line

I am very new to python and SO. The script opens xml files inside of a folder. Using os.walk I iterate over the collection and open the file and then calls the function to iterate over the xml file and update the xml file rewriting the updated file over the original using .writexml. the problem is when i run this program from the command line the it says there is an error
Traceback (most recent call last):
File "./XMLParser.py", line 67, in <module>
xmldoc = minidom.parse(xml)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 5614: ordinal not in range(128)
CODE:
from xml.dom import minidom
import os
import codecs
'''
Function to iterate over the directory that contains the work items
params:
CoreID of new author,
x is the path to the workItem.xml file,
p is the path to the workItem.xml that will be overwritten with new data
'''
def changeauthor(coreid, x, p):
# Gets the content of the xml based within the work item tag.
testcase = x.getElementsByTagName("work-item")[0]
# All fields are stored as a <field> tag with the id attribute being the
# differentiators between them. Fields is a list of all the field tags in the
# document.
fields = testcase.getElementsByTagName("field")
# Loop iterates over the field tags and looks for the one tag where the id
# attribute has a value of author. when this tag is found the tags value is
# updated to the core id passed to the function.
for field in fields:
attribute = field.attributes['id'].value
if attribute == "author":
# print the current author.
print("Previous Author: ", field.firstChild.data)
# Set the author to the core id entered into the script
field.firstChild.data = coreid
# Print the updated author field
print("New Author: ", field.firstChild.data)
# Create a temp file with the same path as the source
tmp_config = p
# Open the new temp file with the write mode set.
with codecs.open(tmp_config, 'w', "utf-8") as f:
# f = open(tmp_config, 'w')
# Write the xml into the file at the same location as the orginal
x.writexml(f)
# Close the file
# f.close()
return
while True:
core = str(input("Enter Core ID of the new author: "))
core = core.upper()
spath = str(input("Please enter the full path to the directory of test cases: "))
count = 0
confirm = str(input("Confirm path and core id (Y/N or Exit to leave script): "))
confirm = confirm.upper()
if confirm == "Y":
'''Hard code path here and comment out line above asking for input either will work.'''
# spath = "/Users/Evan/Desktop/workitems-r583233"
# Loop iterates over the directory. Whenever a workitem.xml file is found the path is stored and the file is
# parsed. the core ID entered and the path as well as the parsed xml doc are passed to the change author
# function.
for roots, dirs, files in os.walk(spath):
for file in files:
title = file.title()
if title == "Workitem.Xml":
path = os.path.join(roots, file)
with codecs.open(path, 'r+', "utf-8") as xml:
xmldoc = minidom.parse(xml)
lst = path.split('/')
wi = lst[5]
print("Updating: ", wi)
changeauthor(core, xmldoc, path)
count += 1
print(wi, "updated succesfully.")
print("-------------------------------")
if count > 0:
# Print how many test cases were updated.
print("All Done", count, "workItems updated!")
else:
print("Please double check path and try again no workItems found to update!")
elif confirm == "N":
continue
elif confirm == "EXIT":
break

Python - How to check if the text is in a file txt?

I have a function that checks if the text is in file.txt or not.
The function works like this: If the text is contained in the file, the file is closed. If the text is not contained in the file, it is added.
But it doesn't work.
import urllib2, re
from bs4 import BeautifulSoup as BS
def SaveToFile(fileToSave, textToSave):
datafile = file(fileToSave)
for line in datafile:
if textToSave in line:
datafile.close()
else:
datafile.write(textToSave + '\n')
datafile.close()
urls = ['url1', 'url2'] # i dont want to public the links.
patGetTitle = re.compile(r'<title>(.*)</title>')
for url in urls:
u = urllib2.urlopen(url)
webpage = u.read()
title = re.findall(patGetTitle, webpage)
SaveToFile('articles.txt', title)
# so here. If the title of the website is already in articles.txt
# the function should close the file.
# But if the title is not found in articles.txt the function should add it.
You can change the SaveToFile function like this
Your title is a list and not a string so you should call it like this SaveToFile('articles.txt', title[0]) to get the first element of the list
def SaveToFile(fileToSave, textToSave):
with open(fileToSave, "r+") as datafile:
for line in datafile:
if textToSave in line:
break
else:
datafile.write(textToSave + '\n')
Notes:
Since you very looping over an empty file the loop did not even run once.
i.e.)
for i in []:
print i # This will print nothing since it is iterating over empty list same as yours
You have passed a list and not a string since re.findall returns a list object you have to pass the first element of the list to the function.
I have used for..else here if the loop is not terminated properly the else case will work.
i.e.)
for i in []:
print i
else:
print "Nooooo"
Output:
Nooooo
Just use r+ mode like this:
def SaveToFile(fileToSave, textToSave):
with open(fileToSave, 'r+') as datafile:
if textToSave not in datafile.read():
datafile.write(textToSave + '\n')
About that file mode, from this answer:
``r+'' Open for reading and writing. The stream is positioned at the
beginning of the file.
And re.find_all() always return a list, so if you're trying to write a list instead of string you'll get an error.
So you could use:
def SaveToFile(fileToSave, textToSave):
if len(textToSave) => 1:
textToSave = textToSave[0]
else:
return
with open(fileToSave, 'r+') as datafile:
if textToSave not in datafile.read():
datafile.write(textToSave + '\n')
You should refactor your SaveToFile function to like this.
def SaveToFile(fileToSave, titleList):
with open(fileToSave, 'a+') as f:
data = f.read()
for titleText in titleList:
if titleText not in data:
f.write(titleText + '\n')
f.close()
This function read a content of file (if exist or created if not) and checks whether textToSave is in the file contents. If it found textToSave then, close file otherwise write content to file.
This seems closer to your problem.
This checks if the text in the file:
def is_text_in_file(file_name, text):
with open(file_name) as fobj:
for line in fobj:
if text in line:
return True
return False
This use the function above to check and writes the text to end of the file if it is not in file yet.
def save_to_file(file_name, text):
if not is_text_in_file in (file_name, text):
with open(file_name, 'a') as fobj:
fobj.write(text + '\n')

Cant get text to append to file in python

Ok so this snippet of code is a http response inside of a flask server. I dont think this information will be of any use but its there if you need to know it.
This Code is suppose to read in the name from the post request and write to a file.
Then it checks a file called saved.txt which is stored in the FILES dictionary.
If we do not find our filename in the saved.txt file we append the filename to the saved file.
APIResponce function is just a json dump
At the moment it doesn't seem to be appending at all. The file is written just fine but append doesn't go thru.
Also btw this is being run on Linino, which is just a distribution of Linux.
def post(self):
try:
## Create the filepath so we can use this for mutliple schedules
filename = request.form["name"] + ".txt"
path = "/mnt/sda1/arduino/www/"
filename_path = path + filename
#Get the data from the request
schedule = request.form["schedule"]
replacement_value = schedule
#write the schedule to the file
writefile(filename_path,replacement_value)
#append the file base name to the saved file
append = True
schedule_names = readfile(FILES['saved']).split(" ")
for item in schedule_names:
if item == filename:
append = False
if append:
append_to = FILES['saved']
filename_with_space =filename + " "
append(append_to,filename_with_space)
return APIResponse({
'success': "Successfully modified the mode."
})
except:
return APIResponse({
'error': "Failed to modify the mode"
})
Here are the requested functions
def writefile(filename, data):
#Opens a file.
sdwrite = open(filename, 'w')
#Writes to the file.
sdwrite.write(data)
#Close the file.
sdwrite.close()
return
def readfile(filename):
#Opens a file.
sdread = open(filename, 'r')
#Reads the file's contents.
blah = sdread.readline()
#Close the file.
sdread.close()
return blah
def append(filename,data):
## use mode a for appending
sdwrite = open(filename, 'a')
## append the data to the file
sdwrite.write(data)
sdwrite.close()
Could it be that the bool object append and the function name append are the same? When I tried it, Python complained with "TypeError: 'bool' object is not callable"

Categories