CSV Writer only writing first line in file - python

So I have patent data I wish to store from an XML to a CSV file. I've been able to run my code through each iteration of the invention name, date, country, and patent number, but when I try to write the results into a CSV file something goes wrong.
The XML data looks like this (for one section of many):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0584026-20090106.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20081222" date-publ="20090106">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0584026</doc-number>
<kind>S1</kind>
<date>20090106</date>
</document-id>
</publication-reference>
My code for running through and writing these lines one-by-one is:
for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
soup = BeautifulSoup(xml_string, "lxml") # BeautifulSoup parses the data strings where the XML is converted to Unicode
pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
lst = [] # Creating empty list to append into
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
#print(inv_name.text, pat_num.text, date_num.text, country.text)
#lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])
And lastly, the output in my .csv file is this:
"Content addressable information encapsulation, representation, and transfer",07475432,20090106,US
I'm unsure where the issue lies and I know I'm still quite a newbie at Python but can anyone find the problem?

You open the file in overwrite mode ('wb') inside a loop. On each iteration you erase what could have been previously written. The correct way is to open the file outside the loop:
...
with open('./output.csv', 'wb') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref: # Looping over all instances of publication
# The final loop finds every instance of invention name, patent number, date, and country to print and append into
for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
...

The problem lies in this line with open('./output.csv', 'wb') as f:
If you want to write all rows into a single file, use mode a. Using wb will overwrite the file and thus you are only getting the last line.
Read more about the file mode here: https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

Related

Removing multiple XML declaration from document

I have a file that has multiple XML declarations.
<?xml version="1.0" encoding="UTF-8"?>
I am currently reading the file as a .txt file and rewriting each line that is not a XML declaration into a new .txt file. As I have many such document files, this method is taking time (around 20mins per file). I wanted to know if there was an easier way to do this.
I am using Python to do this. The files are sitting on my laptop and each file is around 11 Million lines (450mb size).
My code for iterating through the file and removing the declarations is below.
month_file = "2015-01.nml.txt"
delete_lines = [
'<?xml version="1.0" encoding="ISO-8859-1" ?>',
'<?xml version="1.0" encoding="iso-8859-1" ?>',
'<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">',
]
with open(month_file, encoding="ISO-8859-1") as in_fh:
while True:
line = in_fh.readline()
if not line: break
if any(x in line for x in delete_lines):
continue
else:
out_fh = open('myfile_faster.xml', "a")
out_fh.write(line)
out_fh.close()
This is essenstially the same as your version, but opens input and output just the once, also has a single if condition, and writes to the output as it iterates through the input (sort of like sed).
with open(in_file, mode="rt") as f_in, open(out_file, mode="wt") as f_out:
for line in f_in:
if (
not line
or line.startswith("<?xml")
or line.startswith("<!DOCTYPE")
):
continue
f_out.write(line)

python cant read csv file downloaded from azure dev ops (utf-8)

I created an azure dev ops query, and chose 'download results as csv' which gave me a csv file. If I open this csv in vscode, I can see in the bottom right corner it says UTF-8 with BOM
I am trying to write some python function that will read in each value of this csv file. I can not rely parsing text myself and spitting values based on the , comma character, because I will have values that include commas inside them.
If I open my csv in excel, everything is organized perfectly. But if I try to parse the file in python, it reads in every row as a single string separated by commas (bad)
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
print('row=',row)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
console output:
filename: csv_migration\ads-test-direct-download.csv, id_format: ID, encoding: utf-8-sig
row= ['Title,State,Work Item Type,ID,12NC']
row= ['TITLE,WITH COMMAS,To Do,NAME,6034,"value,with,commas"']
done
How can I read this file in python so it separates each value into a list? Instead of this single string
I get the same result with encodingVar='utf-8', should I open my csv in some app like notepadd++ and convert it to utf-16? My code works great for .csv files with utf-16 encoding, it can parse each individual value into a list no problem. why wont this work with a utf-8 DOM csv, even when excel can parse the individual values perfectly fine?
csv file: https://file.io/TXh6uyXKZaug
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
row_as_list = row.split(",") # <-- Gets line as list!
print('row=',row_as_list)
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(
csv_filename="csv_migration\\ads-test-direct-download.csv",
id_format='ID',
encodingVar='utf-8-sig'
)
This snippet splits the line into a list that you can index to get the information out

How to write attributes from an xml file to a text file

I am using ElementTree to get the attributes and elements I need from an xml file. The xml file is queried from mySQL
I want to write out all the attributes and elements into a new text file using python
root = tree.getroot()
name = root.attrib['name']
country = root.find("country").text
I can see the results when I print them out
I want to write to a file the list of all the names and countries in the xml file
So if you generate an array with all your XML name's, you can use this few line of code to create/write a .txt file and write all names on a new line.
list_names = ["OLIVIA", "RUBY", "EMILY", "GRACE", "JESSICA"]
with open('listName.txt', 'w') as filehandle:
# filehandle.writelines("%s\n" % name for name in list_names)
filehandle.writelines("".join("{0}\n".format(name) for name in list_names))
As #Parfait suggested, here is a solution without % to concatenate string.
Source : https://stackabuse.com/reading-and-writing-lists-to-a-file-in-python/

Empty CSV file when writing lots of data

I am currently conducting a data scraping project with Python 3 and am attempting to write the scraped data to a CSV file. My current process to do it is this:
import csv
outputFile = csv.writer(open('myFilepath', 'w'))
outputFile.writerow(['header1', 'header2'...])
for each in data:
scrapedData = scrap(each)
outputFile.writerow([scrapedData.get('header1', 'header 1 NA'), ...])
Once this script is finished, however, the CSV file is blank. If I just run:
import csv
outputFile = csv.writer(open('myFilepath', 'w'))
outputFile.writerow(['header1', 'header2'...])
a CSV file is produced containing the headers:
header1,header2,..
If I just scrape 1 in data, for example:
outputFile.writerow(['header1', 'header2'...])
scrapedData = scrap(data[0])
outputFile.writerow([scrapedData.get('header1', 'header 1 NA'), ...])
a CSV file will be created including both the headers and the data for data[0]:
header1,header2,..
header1 data for data[0], header1 data for data[0]
Why is this the case?
When you open a file with w, it erases the previous data
From the docs
w: open for writing, truncating the file first
So when you open the file after writing scrape data with w, you just get a blank file and then you write the header on it so you only see the header. Try replacing w with a. So the new call to open the file would look like
outputFile = csv.writer(open('myFilepath', 'a'))
You can fine more information about the modes to open the file here
Ref: How do you append to a file?
Edit after DYZ's comment:
You should also be closing the file after you are done appending. I would suggest using the file like the:
with open('path/to/file', 'a') as file:
outputFile = csv.writer(file)
# Do your work with the file
This way you don't have to worry about remembering to close it. Once the code exists the with block, the file will be closed.
I would use Pandas for this:
import pandas as pd
headers = ['header1', 'header2', ...]
scraped_df = pd.DataFrame(data, columns=headers)
scraped_df.to_csv('filepath.csv')
Here I'm assuming your data object is a list of lists.

Generate output files from template file and csv data in python

I need to generate xml files poulated with data from a csv file in python
I have two input files:
one CSV file named data.csv containing data like this:
ID YEAR PASS LOGIN HEX_LOGIN
14Z 2013 (3e?k<.P#H}l hex0914Z F303935303031345A
14Z 2014 EAeW+ZM..--r hex0914Z F303935303031345A
.......
One Template file named template.xml
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year></year>
<security>
<ID></ID>
<login></login>
<hex_login></hex_login>
<pass></pass>
</security>
</SecurityProfile>
I want to get as many output files as lines in the csv data file, each output filed named YEAR_ID, with the data from the csv file in the xml fields:
Output files contentes:
Content of output file #1 named 2013_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2013</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>(3e?k<.P#H}l</pass>
</security>
</SecurityProfile>
Content of output file #2 named 2014_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2014</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>EAeW+ZM..--r</pass>
</security>
</SecurityProfile>
Thank you for your suggestions.
Can you make changes the template? If so, I would do the following to make this a bit simpler:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>{year}</year>
<security>
<ID>{id}</ID>
<login>{login}</login>
<hex_login>{hex_login}</hex_login>
<pass>{pass}</pass>
</security>
</SecurityProfile>
Then, something like this would work:
import csv
input_file_name = "some_file.csv" #name/path of your csv file
template_file_name = "some_file.xml" #name/path of your xml template
output_file_name = "{}_09500{}.xml"
with open(template_file_name,"r") as template_file:
template = template_file.read()
with open(input_file_name,"r") as csv_file:
my_reader = csv.DictReader(csv_file)
for row in my_reader:
with open(output_file_name.format(row["YEAR"],row["ID"]),"w") as current_out:
current_out.write(template.format(year=row["YEAR"],
id=row["ID"],
login=row["LOGIN"],
hex_login=row["HEX_LOGIN"],
pass=row["PASS"]))
If you can't modify the template, or want to process it as XML instead of basic string manipulation, then it's a bit more involved.
EDIT:
Modified answer to use csv.DictReader rather than csv.reader.
Fixed variable names opening input CSV file and writing the output. Removed 'binary' mode file operations.
import csv
from collections import defaultdict
header = '<?xml version="1.0"?><SecurityProfile xmlns="security_profile_v1">\n'
footer = '\n</SecurityProfile>'
entry = '''<security>
<ID>{0[ID]}</ID>
<login>{0[LOGIN]}</login>
<hex_login>{0[HEX_LOGIN]}</hex_login>
<pass>{0[PASS]}</pass>
</security>'''
rows = defaultdict(list)
with open('infile.csv') as f:
reader = csv.DictReader(f, delimiter='\t')
for item in reader:
rows[reader['YEAR']].append(item)
for year,data in rows.iteritems():
with open('{}.xml'.format(year), 'w') as f:
f.write(header)
f.write('<year>{}</year>\n'.format(year))
for record in data:
f.write(entry.format(record))
f.write('\n')
f.write(footer)

Categories