Getting values from an XML file that has deep keys and values - python

I have a very large xml file produced from an application whose part of tree is as below:
There are several items under 'item' from 0 to 7. These names are always named as numbers it can range from 0 to any number.
Each of these items will have multiple items all with same structure as per the above tree. Only item 0 to 7 is variable all other structure remains same.
under I have a value <bbmds_questiontype>: which can be Multiple Choice or Matching or Essays.
What I need is to have a list the values of <mat_formattedtext>. ie. the output is supposed to be:
<0>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext></0>
<1>
<bbmds_questiontype>Multiple Choice</bbmds_questiontype>
<mat_formattedtext>This is second question </mat_formattedtext> </1>
<2>
<bbmds_questiontype>Essay</bbmds_questiontype>
<mat_formattedtext>This is first question </mat_formattedtext> </2>
....
I have tried several solution included xml tree, xmltodict all getting complicated as filters to be applied across different branches of children
import xmltodict
with open("C:/Users/SS/Desktop/moodlexml/00001_questions.dat") as fd:
doc = xmltodict.parse(fd.read())
shortened=doc['questestinterop']['assessment']['section']['item'] # == u'an attribute'
Any advice will be appreciated to proceed further.

Have you tried to use bs4 parsing, its simple
Check it out
https://linuxhint.com/parse_xml_python_beautifulsoup/

Related

Find for multiple tags' values with lxml

I am using lxml to parse an XML like this sample one:
<compounddef xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="d2/db7/class_foo" kind="class">
<compoundname>FooClass</compoundname>
<sectiondef kind="public-type">
<memberdef kind="typedef" id="d2/db7/class_bar">
<type><ref refid="d3/d73/struct_foo" kindref="compound">StructFoo</ref></type>
<definition>StructFooDefinition</definition>
</memberdef>
</sectiondef>
</compounddef>
I'm trying to get the element with <refid> "d3/d73/struct_foo" and with the <definition> containing the text "Foo".
There could be many refid with that value and many definitions containing Foo, but only one has this combination.
I am able to first find all the elements with that refid and then filter this list by checking which of them containts "Foo" in the , but since I'm working with a really big XML file (~1GB) and the application is time sensitive, I wanted to avoid this.
I tried combining the various etree paths using the keyword 'and' or '//precede:...', but without success.
My last try was:
self.dox_tree_root_.xpath(".//compounddef[#kind = 'class']//memberdef[#kind='typedef'][/type/ref[#refid='%s'] and contains(definition, 'name')]" % (independent_type_refid, name)))
but it is giving me an error.
Is there a way to combine the two filters inside one command?
You can use XPATH
//a[.//ref[#refid="12345"] and contains(c, "Good")]
If I understand your correctly, this should get you close enough:
.//compounddef[#kind = 'class']//memberdef[#kind='typedef'][./type/ref[#refid='d3/d73/struct_foo']][contains(.//definition, 'Foo')]//definition
Output:
StructFooDefinition

Cannot pull data from XML files due to differences in format

I have a script that takes a bunch of XML files, all in the form of: HMDB61152.xml and pulls them all in using glob. For each file I need to pull some details about each, such as accession, name, and a list of diseases. To parse through each XML I used xmltodict because I traditionally like working with lists instead of XML files, although I may need to change my strategy due to the issues I am facing.
I am able to pull name and acc easily since all XML files have it in the same first level of the tree:
path = '/Users/me/Downloads/hmdb_metabolites'
for data_file in glob.glob(os.path.join(path,'*.xml')):
diseases=[]
with open(data_file) as fd:
doc = xmltodict.parse(fd.read())
name = doc['metabolite']['name']
acc = doc['metabolite']['accession']
So basically at this point there are three options for the disease information:
There are multiple disease tags within each diseases tree. I.e there are 2 or more diseases for the given accession.
There is one disease within the diseases tree meaning the accession has only one disease.
or
There are no disease in the diseases tree at all.
I need to write a loop that can handle any three cases, and thats where I am failing. Here is my approach so far:
#I get the disease root, which returns True if it has lower level items (one or more disease within diseases)
#or False if there are no disease within diseases.
dis_root=doc['metabolite']['diseases']
if (bool(dis_root)==True):
dis_init = doc['metabolite']['diseases']['disease']
if (bool(doc['metabolite']['diseases']['disease'][0]) == True):
for x in range(0,len(dis_init)):
diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
else:
diseases.append(doc['metabolite']['diseases']['disease']['name'])
else:
diseases=['None']
So the problem is, for the case where there are multiple diseases, I need to pull their names in the following format: doc['metabolite']['diseases']['disease'][x]['name'] for each x in diseases. But for the ones that have only one disease, they have no index at all, so the only way I can pull the name of that one disease is by doing doc['metabolite']['diseases']['disease']['name'].
The script is failing because as soon as we encounter a case of only one disease, it returns a KeyError when it tries to test if doc['metabolite']['diseases']['disease'][0]) == True. If anyone can help me figure this out that'd be great, or direct me to a more appropriate strategy.
Try something like
if 0 in doc['metabolite']['diseases']['disease']:
pass # if 0 is a key in the array, we have multiple entries
else
pass # only a single item.
Found a relatively easy workaround, I simply use try in the following way:
try:
for x in range(0,len(dis_init)):
diseases.append(doc['metabolite']['diseases']['disease'][x]['name'])
except KeyError:
diseases.append(doc['metabolite']['diseases']['disease']['name'])

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4
An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌​L'])
Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Search a single column for a particular value in a CSV file and return an entire row

Issue
The code does not correctly identify the input (item). It simply dumps to my failure message even if such a value exists in the CSV file. Can anyone help me determine what I am doing wrong?
Background
I am working on a small program that asks for user input (function not given here), searches a specific column in a CSV file (Item) and returns the entire row. The CSV data format is shown below. I have shortened the data from the actual amount (49 field names, 18000+ rows).
Code
import csv
from collections import namedtuple
from contextlib import closing
def search():
item = 1000001
raw_data = 'active_sanitized.csv'
failure = 'No matching item could be found with that item code. Please try again.'
check = False
with closing(open(raw_data, newline='')) as open_data:
read_data = csv.DictReader(open_data, delimiter=';')
item_data = namedtuple('item_data', read_data.fieldnames)
while check == False:
for row in map(item_data._make, read_data):
if row.Item == item:
return row
else:
return failure
CSV structure
active_sanitized.csv
Item;Name;Cost;Qty;Price;Description
1000001;Name here:1;1001;1;11;Item description here:1
1000002;Name here:2;1002;2;22;Item description here:2
1000003;Name here:3;1003;3;33;Item description here:3
1000004;Name here:4;1004;4;44;Item description here:4
1000005;Name here:5;1005;5;55;Item description here:5
1000006;Name here:6;1006;6;66;Item description here:6
1000007;Name here:7;1007;7;77;Item description here:7
1000008;Name here:8;1008;8;88;Item description here:8
1000009;Name here:9;1009;9;99;Item description here:9
Notes
My experience with Python is relatively little, but I thought this would be a good problem to start with in order to learn more.
I determined the methods to open (and wrap in a close function) the CSV file, read the data via DictReader (to get the field names), and then create a named tuple to be able to quickly select the desired columns for the output (Item, Cost, Price, Name). Column order is important, hence the use of DictReader and namedtuple.
While there is the possibility of hard-coding each of the field names, I felt that if the program can read them on file open, it would be much more helpful when working on similar files that have the same column names but different column organization.
Research
CSV Header and named tuple:
What is the pythonic way to read CSV file data as rows of namedtuples?
Converting CSV data to tuple: How to split a CSV row so row[0] is the name and any remaining items are a tuple?
There were additional links of research, but I cannot post more than two.
You have three problems with this:
You return on the first failure, so it will never get past the first line.
You are reading strings from the file, and comparing to an int.
_make iterates over the dictionary keys, not the values, producing the wrong result (item_data(Item='Name', Name='Price', Cost='Qty', Qty='Item', Price='Cost', Description='Description')).
for row in (item_data(**data) for data in read_data):
if row.Item == str(item):
return row
return failure
This fixes the issues at hand - we check against a string, and we only return if none of the items matched (although you might want to begin converting the strings to ints in the data rather than this hackish fix for the string/int issue).
I have also changed the way you are looping - using a generator expression makes for a more natural syntax, using the normal construction syntax for named attributes from a dict. This is cleaner and more readable than using _make and map(). It also fixes problem 3.

Search for specific XML element Attribute values

Using Python ElementTree to construct and edit test messages:
Part of XML as follows:
<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#" RptTyp="0" TrdDt="20120201" MtchTyp="4" LastMkt="ABCD" LastPx="104.11">
The key TrdID contain values beginning with $$ to identify that this value is variable data and needs to be amended once the message is constructed from a template, in this case to the next sequential number (stored in a dictionary - the overall idea is to load a dictionary from a file with the attribute key listed and the associated value such as the next sequential number e.g. dictionary file contains $$+TrdID# 12345 using space as the delim).
So far my script iterates the parsed XML and examines each indexed element in turn. There will be several fields in the xml file that require updating so I need to avoid using hard coded references to element tags.
How can I search the element/attribute to identify if the attribute contains a key where the corresponding value starts with or contains the specific string $$?
And for reasons unknown to me we cannot use lxml!
You can use XPath.
import lxml.etree as etree
import StringIO from StringIO
xml = """<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#"
RptTyp="0"
TrdDt="20120201"
MtchTyp="4"
LastMkt="ABCD"
LastPx="104.11"/>
</FIXML>"""
tree = etree.parse(StringIO(xml))
To find elements TrdMtchRpt where the attribute TrdID starts with $$:
r = tree.xpath("//TrdMtchRpt[starts-with(#TrdID, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
If you want to find any element where at least one attribute starts with $$ you can do this:
r = tree.xpath("//*[starts-with(#*, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
Look at the documentation:
http://lxml.de/xpathxslt.html#the-xpath-method
http://www.w3schools.com/xpath/xpath_functions.asp#string
http://www.w3schools.com/xpath/xpath_syntax.asp
You can use ElementTree package. It gives you an object with a hierarchical data structure from XML document.

Categories