i was trying to extract links from many text files for that i tried to make a name list of text files in separate file to store this names in list . but when i tried to get output as print it give me my desire output but when tried to store it in list or a new file it is showing encoded or not readable format .
i marked my desire output in green and undesired output in redcircle.
Can you guide me to achieve my output according to desire
docs = []
with open('files.txt','r') as f:
content = f.readlines()
for doc in content:
if '-' in doc:
print(doc[101:])
docs.append(doc[101:])
#print(doc[101:])
#print(type(doc))
print(docs)
This snippet might help you in fetching file names from the txt file
docs = []
with open('files.txt', 'r') as f:
content = f.readlines()
for doc in content:
file_name = doc.rstrip().split(" ")[-1]
docs.append(file_name)
print(docs)
I'm pretty new to XML parsing with Python with minidom.
I have got this XML:
<filelist>
<file id="1.jpg"></file>
</filelist>
I would like to add and then save to the same file the following row for example:
<file id="2.jpg"></file>
I am doing the parsing using:
doc = minidom.parse('filelist.xml')
files = doc.getElementsByTagName('file')
for file in files:
idFile = file.getAttribute("id")
print(idFile)
How I can add that "element" and then save to same file?
Starting with your original code, the following additions were added to your code to accomplish adding an element and saving it back to the original file:
Create a new 'file' element
Set the new 'file' element 'id' attribute
Retrieve the document root ('filelist') node
Append the new 'file' element to the 'filelist' node
Write updated XML back to original file
See updated code following with comments to match this list of additions.
from xml.dom import minidom
doc = minidom.parse('filelist.xml')
# 1. Create a new 'file' element
new_file_element = doc.createElement('file')
# 2. Set the new 'file' element 'id' attribute
new_file_element.setAttribute('id', '2.jpg')
# 3. Retrieve the document root ('filelist') node
filelist_element = doc.documentElement
# 4. Append the new 'file' element to the 'filelist' node
filelist_element.appendChild(new_file_element)
files = doc.getElementsByTagName('file')
for file in files:
idFile = file.getAttribute('id')
print(idFile)
# 5. Write updated XML back to original file
with open('filelist.xml', 'w') as xml_file:
doc.writexml(xml_file, encoding='utf-8')
I am using python to deal with xml file, I need to insert one line to the xml file, and the code is like this:
xobj = ET.parse('/src/xxx.xml')
xroot = xobj.getroot()
filename = ET.Element("filename")
filename.text = xmlname
xroot.insert(0, filename)
tree = ET.ElementTree(xroot)
tree.write('/dst/xxx.xml')
It did insert one line of contents to the original xml file, but it was not a line. My xml file becomes:
<filename>004228.xml</filename><object>
....
</object>
There should be a \n between </filename> and <object>, but this method does not have that line spliter, how could I make the format look nice ?
So I have patent data I wish to store from an XML to a CSV file. I've been able to run my code through each iteration of the invention name, date, country, and patent number, but when I try to write the results into a CSV file something goes wrong.
The XML data looks like this (for one section of many):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0584026-20090106.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20081222" date-publ="20090106">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0584026</doc-number>
<kind>S1</kind>
<date>20090106</date>
</document-id>
</publication-reference>
My code for running through and writing these lines one-by-one is:
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
#print(inv_name.text, pat_num.text, date_num.text, country.text)
#lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
And lastly, the output in my .csv file is this:
"Content addressable information encapsulation, representation, and transfer",07475432,20090106,US
I'm unsure where the issue lies and I know I'm still quite a newbie at Python but can anyone find the problem?
You open the file in overwrite mode ('wb') inside a loop. On each iteration you erase what could have been previously written. The correct way is to open the file outside the loop:
...
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
...
The problem lies in this line with open('./output.csv', 'wb') as f:
If you want to write all rows into a single file, use mode a. Using wb will overwrite the file and thus you are only getting the last line.
Read more about the file mode here: https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
I have a file with a bunch of information with this following format:
<name>New York Jets</name>
I am trying to get each line of the file into it's own string. For example, I want this line to say "This is the roster for the New York Jets." I have this so far, but it has "This is the roster for the" for every single line. I think I have to use-
inputString.split('\n')
But I'm not sure where to put it in at. This is what I have so far.
def summarizeData(filename):
with open(filename,"r") as fo:
for rec in fo:
name=rec.split('>')[1].split('<')[0]
print("Here is the roster for the %s." % (name))
and I call summarizeData("NewYorkJets.txt"). So basically I am trying to split each line from the file to get it in it's own string
from xml.dom import minidom
xmldoc = minidom.parse('filename.txt')
itemlist = xmldoc.getElementsByTagName('item')
for s in itemlist:
print(s.attributes['name'].value)
you can read a file having tags like this, and retrieve the values.