I receive an XML document with many child elements which I need to extract the info and then export to a CSV or text document so I can import to Quickbooks. The XML tree looks like the following:
<MODocuments>
<MODocument>
<Document>TX1126348</Document>
<DocStatus>P</DocStatus>
<DateIssued>20180510</DateIssued>
<ApplicantName>COMPANY FRUIT & VEGETABLE</ApplicantName>
<MOLots>
<MOLot>
<LotID>A</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>15500</TotalPounds>
</MOLot>
<MOLot>
<LotID>B</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>175</TotalPounds>
</MOLot>
<MOLot>
<LotID>C</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>7500</TotalPounds>
</MOLot>
<MOLot>
<LotID>D</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>300</TotalPounds>
</MOLot>
</MOLots>
</MODocument>
<MODocument>
<Document>TX1126349</Document>
<DocStatus>P</DocStatus>
<DateIssued>20180511</DateIssued>
<ApplicantName>COMPANY FRUIT & VEGETABLE</ApplicantName>
<MOLots>
<MOLot>
<LotID>A</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>25200</TotalPounds>
</MOLot>
<MOLot>
<LotID>B</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>16800</TotalPounds>
</MOLot>
</MOLots>
</MODocument>
<MODocument>
<Document>TX1126350</Document>
<DateIssued>20180511</DateIssued>
<ApplicantName>COMPANY FRUIT & VEGETABLE</ApplicantName>
<MOLots>
<MOLot>
<LotID>A</LotID>
<ProductVariety>Yellow</ProductVariety>
<TotalPounds>14100</TotalPounds>
</MOLot>
</MOLots>
</MODocument>
</MODocuments>
I need to extract the TotalPounds from each MODocument parent so the output would look like this:
DOCUMENT number, APPLICANT NAME, and TOTAL POUNDS added up for all the MOLots in that one document.
TX1126348 COMPANY FRUIT & VEGETABLE 23475
TX1126349 COMPANY FRUIT & VEGETABLE 42000
TX1126350 COMPANY FRUIT & VEGETABLE 14100
Here's the code I'm working with:
import xml.etree.ElementTree as ET
tree = ET.parse('TX_959_20180514131311.xml')
root = tree.getroot()
docCert = []
docComp = []
totalPounds=[]
for MODocuments in root:
for MODocument in MODocuments:
docCert.append(MODocument.find('Document').text)
docComp.append(MODocument.find('ApplicantName').text)
for MOLots in MODocument:
for MOLot in MOLots:
totalPounds.append(int(MOLot.find('TotalPounds').text))
for i in range(len(docCert)):
print(i, docCert[i],' ', docComp[i], totalPounds[i])
This is my output, and I don't know how to add up the totals for each Document.. please help.
0 TX1126348 COMPANY FRUIT & VEGETABLE 15500
1 TX1126349 COMPANY FRUIT & VEGETABLE 175
2 TX1126350 COMPANY FRUIT & VEGETABLE 7500
If you can use lxml, you can have the XPath sum() function sum all of the TotalPounds for you.
Example...
from lxml import etree
import csv
tree = etree.parse("TX_959_20180514131311.xml")
with open("output.csv", "w", newline="") as csvfile:
csvwriter = csv.writer(csvfile, delimiter=",", quoting=csv.QUOTE_MINIMAL)
for mo_doc in tree.xpath("/MODocuments/MODocument"):
csvwriter.writerow([mo_doc.xpath("Document")[0].text,
mo_doc.xpath("ApplicantName")[0].text,
int(mo_doc.xpath("sum(MOLots/MOLot/TotalPounds)"))])
contents of "output.csv"...
TX1126348,COMPANY FRUIT & VEGETABLE,23475
TX1126349,COMPANY FRUIT & VEGETABLE,42000
TX1126350,COMPANY FRUIT & VEGETABLE,14100
Also, you have lots of control over quoting, delimiters, etc. by writing the output with csv.
It looks like there will be more items in totalPounds than in docCert or docComp. I think you need to do something like this:
for MODocuments in root:
for MODocument in MODocuments:
docCert.append(MODocument.find('Document').text)
docComp.append(MODocument.find('ApplicantName').text)
sub_total = 0
for MOLots in MODocument:
for MOLot in MOLots:
sub_total += int(MOLot.find('TotalPounds').text)
totalPounds.append(sub_total)
Related
I have an .xlsx file like this:
item price
foo 5$
poo 3$
woo 7$
moo 2$
I want to use the openpyxl to open the file and add a new column to it like this:
item price owner
foo 5$ Jim owns foo
poo 3$ Jack owns poo
woo 7$ John owns woo
moo 2$ Jay owns moo
Anyone can help me with how to do it?
My code:
file_location = 'excel_name.xlsx'
df = pd.read_excel(file_errors_location, engine='openpyxl')
for item in df['item']:
df['sub'].append(f'bla bla owms {item}')
df.to_excel('excel_me.xlsx', engine='openpyxl')
You don't need pandas for something this simple.
So working with 'item' column as column 'A'
import openpyxl as op
wb = op.load_workbook(r'foo.xlsx')
ws = wb["Sheet1"]
owner_list = ['owner', 'Jim', 'Jack', 'John', 'Jay']
for enum, cell in enumerate(ws['A']):
row = enum+1
if enum == 0:
ws.cell(row=row, column=3).value = owner_list[enum]
else:
ws.cell(row=row, column=3).value = owner_list[enum] + " owns " + cell.value
wb.save('foo.xlsx')
The question I am trying to answer is
Query if a book title is available and present option of (a) increasing stock level or (b) decreasing the stock level, due to a sale. If the stock level is decreased to zero indicate to the user that the book is currently out of stock.
This is the text file
#Listing showing sample book details
#AUTHOR, TITLE, FORMAT, PUBLISHER, COST?, STOCK, GENRE
P.G. Wodehouse, Right Ho Jeeves, hb, Penguin, 10.99, 5, fiction
A. Pais, Subtle is the Lord, pb, OUP, 12.99, 2, biography
A. Calaprice, The Quotable Einstein, pb, PUP, 7.99, 6, science
M. Faraday, The Chemical History of a Candle, pb, Cherokee, 5.99, 1, science
C. Smith, Energy and Empire, hb, CUP, 60, 1, science
J. Herschel, Popular Lectures, hb, CUP, 25, 1, science
C.S. Lewis, The Screwtape Letters, pb, Fount, 6.99, 16, religion
J.R.R. Tolkein, The Hobbit, pb, Harper Collins, 7.99, 12, fiction
C.S. Lewis, The Four Loves, pb, Fount, 6.99, 7, religion
E. Heisenberg, Inner Exile, hb, Birkhauser, 24.95, 1, biography
G.G. Stokes, Natural Theology, hb, Black, 30, 1, religion
And this is the code i have so far
def Task5():
again = 'y'
while again == 'y':
desc = input('Enter the title of the book you would like to search for: ')
for bookrecord in book_list:
if desc in book_list:
print('Book found')
else:
print('Book not found')
break
again = input('\nWould you like to search again(press y for yes)').lower()
i already have a function which reads from the text file:
book_list = []
def readbook():
infile = open('book_data_file.txt')
for row in infile:
start = 0 # used to start at the beginning of each line
string_builder = []
if not(row.startswith('#')):
for index in range(len(row)):
if row[index] ==',' or index ==len(row)-1:
string_builder.append(row[start:index])
start = index+1
book_list.append(string_builder)
infile.close()
Any one have an idea on how i complete this task? :)
Get the titles from the book_list variable.
titles = [data[1].strip() for data in book_list]
Remove any white-space from the desc variable.
desc = desc.strip()
For instance If I'm searching for Popular Lectures book, but If I type Popular Lectures then I couldn't find it in the book_list Therefore you should remove the white characters from the input.
If the book is avail, then get the book name and stock value from the book_list
info = [(title, int(book_list[idx][5].strip())) for idx, title in enumerate(titles) if desc in title][0]
bk_nm, stock = info
Print the current situation
if stock == 0:
print("{} is currently not avail".format(bk_nm))
else:
print("{} is avail w/ stock {}".format(bk_nm, stock))
Example:
Enter the title of the book you would like to search for: Popular Lectures
Popular Lectures is avail w/ stock 1
Would you like to search again(press y for yes)y
Enter the title of the book you would like to search for: Chemical
The Chemical History of a Candle is avail w/ stock 1
Would you like to search again(press y for yes)n
Code:
book_list = []
def task5():
titles = [data[1].strip() for data in book_list]
again = 'y'
while again == 'y':
desc = input('Enter the title of the book you would like to search for: ')
desc = desc.strip() # Remove any white space
stock = [int(book_list[idx][5].strip()) for idx, title in enumerate(titles) if desc in title][0]
if stock == 0:
print("{} is currently not avail".format(desc))
else:
print("{} is avail w/ stock {}".format(desc, stock))
again = input('\nWould you like to search again(press y for yes)').lower()
def read_txt():
infile = open('book_data_file.txt')
for row in infile:
start = 0 # used to start at the beginning of each line
string_builder = []
if not (row.startswith('#')):
for index in range(len(row)):
if row[index] == ',' or index == len(row) - 1:
string_builder.append(row[start:index])
start = index + 1
book_list.append(string_builder)
infile.close()
if __name__ == '__main__':
read_txt()
task5()
I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))
My python 3 script takes an xml file and creates a csv file.
Small excerpt of xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<metadata>
<dc>
<title>Golden days for boys and girls, 1895-03-16, v. XVI #17</title>
<subject>Children's literature--Children's periodicals</subject>
<description>Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries</description>
<publisher>James Elverson, 1880-</publisher>
<date>1895-06-15</date>
<type>Text | periodicals</type>
<format>image/jp2</format>
<handle>http://hdl.handle.net/11134/20002:860074494</handle>
<accessionNumber/>
<barcode/>
<identifier>20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870 | hdl: | http://hdl.handle.net/11134/20002:860074494</identifier>
<rights>These Materials are provided for educational and research purposes only. The University of Connecticut Libraries hold the copyright except where noted. Permission must be obtained in writing from the University of Connecticut Libraries and/or theowner(s) of the copyright to publish reproductions or quotations beyond "fair use." | The collection is open and available for research.</rights>
<creator/>
<relation/>
<coverage/>
<language/>
</dc>
</metadata>
Python3 code:
import csv
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='ctda_set1_uniqueTags.xml')
doc = ET.parse("ctda_set1_uniqueTags.xml")
root = tree.getroot()
oaidc_data = open('ctda_set1_uniqueTags.csv', 'w', encoding='utf-8')
titles = 'dc/title'
subjects = 'dc/subject'
csvwriter = csv.writer(oaidc_data)
oaidc_head = ['Title', 'Subject', 'Description', 'Publisher', 'Date', 'Type', 'Format', 'Handle', 'Accession Number', 'Barcode', 'Identifiers', 'Rights', 'Creator', 'Relation', 'Coverage', 'Language']
count = 0
for member in root.findall('dc'):
if count == 0:
csvwriter.writerow(oaidc_head)
count = count + 1
dcdata = []
titles = member.find('title').text
dcdata.append(titles)
subjects = member.find('subject').text
dcdata.append(subjects)
descriptions = member.find('description').text
dcdata.append(descriptions)
publishers = member.find('publisher').text
dcdata.append(publishers)
dates = member.find('date').text
dcdata.append(dates)
types = member.find('type').text
dcdata.append(types)
formats = member.find('format').text
dcdata.append(formats)
handle = member.find('handle').text
dcdata.append(handle)
accessionNo = member.find('accessionNumber').text
dcdata.append(accessionNo)
barcodes = member.find('barcode').text
dcdata.append(barcodes)
identifiers = member.find('identifier').text
dcdata.append(identifiers)
rt = member.find('rights').text
print(member.find('rights').text)
dcdata.append('rt')
ct = member.find('creator').text
dcdata.append('ct')
rt = member.find('relation').text
dcdata.append('rt')
ce = member.find('coverage').text
dcdata.append('ce')
lang = member.find('language').text
dcdata.append('lang')
csvwriter.writerow(dcdata)
oaidc_data.close()
Everything works as expected except for rt, ce, and lang. What happens is that in the csv, all the data is written with the comma delimiter. For rt, the value is always rt, for ce, ce, lang, lang, etc.
Here's a snippet of the output:
Title,Subject,Description,Publisher,Date,Type,Format,Handle,Accession Number,Barcode,Identifiers,Rights,Creator,Relation,Coverage,Language
"Golden days for boys and girls, 1895-03-16, v. XVI #17",Children's literature--Children's periodicals,"Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries","James Elverson, 1880-",1895-06-15,Text | periodicals,image/jp2,hdl.handle.net/11134/20002:860074494,,,20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870,**rt,ct,rt,ce,lang**
Some of the rights statements get very long - perhaps that's the issue. That's why I added the print(member.find('rights')) to see the output. The text is printed just fine. The text just isn't written to the csv. What I'd like is to have the value or text written for these xml tags. Any help would be appreciated.
Thanks.
Jennifer
In the line dcdata.append('rt') there is no need for the quotes. Try dcdata.append(rt). Similarly, there are unnecessary quotes in the ce and lang lines.
I am trying to write a function in python that opens a file and parses it into a dictionary. I am trying to make the first item in the list block the key for each item in the dictionary data. Then each item is supposed to be the rest of the list block less the first item. For some reason though, when I run the following function, it parses it incorrectly. I have provided the output below. How would I be able to parse it like I stated above? Any help would be greatly appreciated.
Function:
def parseData() :
filename="testdata.txt"
file=open(filename,"r+")
block=[]
for line in file:
block.append(line)
if line in ('\n', '\r\n'):
album=block.pop(1)
data[block[1]]=album
block=[]
print data
Input:
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
Output:
{'-Rainy Day Women #12 & 35\n': '1966 Blonde on Blonde\n',
'-Whole Lotta Love\n': '1969 II\n', '-In the Evening\n': '1979 In Through the Outdoor\n'}
You can use groupby to group the data using the empty lines as delimiters, use a defaultdict for repeated keys extending the rest of the values from each val returned from groupby after extracting the key/first element.
from itertools import groupby
from collections import defaultdict
d = defaultdict(list)
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
# if k is True we have a section
if k:
# get key "k" which is the first line
# from each section, val will be the remaining lines
k,*v = val
# add or add to the existing key/value pairing
d[k].extend(map(str.rstrip,v))
from pprint import pprint as pp
pp(d)
Output:
{'Bob Dylan\n': ['1966 Blonde on Blonde',
'-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands'],
'Led Zeppelin\n': ['1979 In Through the Outdoor',
'-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl",
'1969 II',
'-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home']}
For python2 the unpack syntax is slightly different:
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, v = next(val), val
d[k].extend(map(str.rstrip, v))
If you want to keep the newlines remove the map(str.rstrip..
If you want the album and songs separately for each artist:
from itertools import groupby
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, alb, songs = next(val),next(val), val
d[k.rstrip()][alb.rstrip()] = list(map(str.rstrip, songs))
from pprint import pprint as pp
pp(d)
{'Bob Dylan': {'1966 Blonde on Blonde': ['-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or '
'Later)',
'-I Want You',
'-Stuck Inside of Mobile with the '
'Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
'-Most Likely You Go Your Way '
"(And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': {'1969 II': ['-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': ['-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
I guess this is what you want?
Even if this is not the format you wanted, there are a few things you might learn from the answer:
use with for file handling
nice to have:
PEP8 compilant code, see http://pep8online.com/
a shebang
numpydoc
if __name__ == '__main__'
And SE does not like a list being continued by code...
#!/usr/bin/env python
""""Parse text files with songs, grouped by album and artist."""
def add_to_data(data, block):
"""
Parameters
----------
data : dict
block : list
Returns
-------
dict
"""
artist = block[0]
album = block[1]
songs = block[2:]
if artist in data:
data[artist][album] = songs
else:
data[artist] = {album: songs}
return data
def parseData(filename='testdata.txt'):
"""
Parameters
----------
filename : string
Path to a text file.
Returns
-------
dict
"""
data = {}
with open(filename) as f:
block = []
for line in f:
line = line.strip()
if line == '':
data = add_to_data(data, block)
block = []
else:
block.append(line)
data = add_to_data(data, block)
return data
if __name__ == '__main__':
data = parseData()
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(data)
which gives:
{ 'Bob Dylan': { '1966 Blonde on Blonde': [ '-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': { '1969 II': [ '-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': [ '-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}