Python development - elementtree XML and string operations - python

I am using ElementTree to load up a series of XML files and parse them. As a file is parsed, I am grabbing a few bits of data from it ( a headline and a paragraph of text). I then need to grab some file names that are stored in the XML. They are contained in an element called ContentItem.
My code looks a bit like this:
for item in dirlist:
newsML = ET.parse(item)
NewsLines = newsML.getroot()
HeadLine = NewsLines.getiterator("HeadLine")
result.append(HeadLine)
p = NewsLines.getiterator("p")
result.append(p)
ci = NewsLines.getiterator("ContentItem")
for i in ci:
result.append(i.attrib)
Now, if there was only one type of file, this would have been fine, but it contains 3 types (jpg, flv and a mp4). So as I loop through them in the view, it spits them out, but how do I just grab the flv if I only want that one? or just the mp4? They don't always appear in the same order in the list either.
Is there a way to say if it ends in .mp4 then do this action, or is there a way to do that in the template even?
If i try to do this;
url = i.attrib
if url.get("Href", () ).endswith('jpg'):
result.append(i.attrib)
I get an error tuple object has no attribute endswith. Why is this a tuple? I thought it was a dict?

You get a tuple because you supply a tuple (the parentheses) as the default return value for url.get(). Supply an empty string, and you can use its .endswith() method. Also note that the element itself has a get() method to retrieve attribute values (you do not have to go via .attrib). Example:
if i.get('Href', '').endswith('.jpg'):
result.append(i.attrib)

Related

Select Substring from Larger String and Append to List

I'm currently doing some API work with Tenable.io, and I'm having some trouble selecting substrings. I'm sending requests for scan histories, and the API responds with a continuous string of all scans in JSON format. The response I get is a very large continuous string of data, and I need to select some substrings (a few values), and copy that to a list (just for now). Getting data into a list isn't where I'm stuck - I require some serious assistance with selecting the substrings I need. Each scan has the following attributes:
id
status
is_archived
targets
scan_uuid
reindexing
time_start (unix format)
time_end (unix format)
Each of these has a value/boolean following it (see below). I need a way to extract the values following "id":, "scan_uuid:", and "time_start": from the string (and put it in a list just for now).
I'd like to do this without string.index, as this may break the script if the response length changes. There is also a new scan everyday, so the overall length of the response will change. Due to the nature of the data, I'd imagine the ideal solution would be to specify a condition that will select x amount of characters after "id":, "scan_uuid:", and "time_start":, and append them to a list, with the output looking something like:
scan_id_10_response = ["12345678", ""15b6e7cd-447b-84ab-84d3-48a62b18fe6c", "1639111111", etc, etc]
String is below - I've only included the data for 4 scans for simplicity's sake. I've also changed the values for security reasons, but the length & format of the values are the same.
scan_id_10_response = '{"pagination":{"offset":0,"total":119,"sort":[{"order":"DESC","name":"start_date"}],"limit":100},"history":[\
{"id":12345678,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"15b6e7cd-447b-84ab-84d3-48a62b18fe6c","reindexing":null,"time_start":1639111111,"time_end":1639111166},\
{"id":23456789,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"8a468cff-c64f-668a-3015-101c218b68ae","reindexing":null,"time_start":1632222222,"time_end":1632222255},\
{"id":34567890,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"84ea995a-584a-cc48-e352-8742a38c12ff","reindexing":null,"time_start":1639333333,"time_end":1639333344},\
{"id":45678901,"status":"completed","is_archived":false,"targets":{"custom":false,"default":null},"visibility":"public","scan_uuid":"48a95366-48a5-e468-a444-a4486cdd61a2","reindexing":null,"time_start":1639444444,"time_end":1639444455}\
]}'
Basically you can use the standard json module to parse the json string.
Using that code snippet you obtain a dict you can then work with.
import json
c = json.loads(scan_id_10_response)
Now you can for example create a list of list with the desired attributes:
extracted_data = [[d['id'], d['scan_uuid'], d['time_start']] for d in c['history']]
This returns for this particular example:
[[12345678, '15b6e7cd-447b-84ab-84d3-48a62b18fe6c', 1639111111],
[23456789, '8a468cff-c64f-668a-3015-101c218b68ae', 1632222222],
[34567890, '84ea995a-584a-cc48-e352-8742a38c12ff', 1639333333],
[45678901, '48a95366-48a5-e468-a444-a4486cdd61a2', 1639444444]]
If you only want one result at a time use a generator or iterate over the list
gen_extracted = ([d['id'], d['scan_uuid'], d['time_start']] for d in x['history'])
If you dont want to work with a dict i would reccomend you a look into regular expressions.

Whoosh: How to get search excerpts as a list

From the Whoosh documentation I can get matched search terms with some context with:
results = mysearcher.search(myquery)
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))
I'd like to access the "highlights" as a list of separated items (so that I can enumerate them in a html list) but the output of hit.highlights() appears to be of type <class 'str'>, and it's not clear to me that there's a unique delimiter.
Is there a way I can get a list instead of everything concatenated into one string?
You could just convert the highlighted string with the separator "..." to a list.
It's as simple as this:
highlights_list = hit.highlights("content").split("...")

How to insert new line between Python dict value

I'm trying to manipulate the dict output that showing in the html page using Flask and Jinja2 template.
I'm looking for help with
Adding new lines between every dict values.
make http text clickable for https://example.com:8087
The way I created my dictionary is
usedPort[node][z_port] = (z_owner, docker_stack, url)
The expectation of the result is
john_doe
Zeppelin-Engineer-Individual-TAP
https://example.com:8087
But actually, I got
(john_doe, Zeppelin-Engineer-Individual-TAP, https://example.com:8087)
There's nothing involved print operation, I don't want to print the output in the terminal but want to show this dict value in the html page instead.
for http text, I've tried with webbrowser module unfortunately, It didn't work.
You are using a tuple, and you do not tell how you display it. If you simply pass the tuple to something that displays it (be it print or anything else) it will use the default representation, which is what you get.
Instead, pass what you want to actually represent:
'\n'.join(str(x) for x in my_tuple) # Can use use `'\n'.join(my_tuple) if everything is a string
For some overkill you can define your own set (using collection.UserTuple or just inheriting from tuple which could create some problems for some uses)
class Tuple(tuple):
def __repr__(self)
return '\n'.join(str(x) for x in self)
The you would have to use Tuple(...) instead of just (...), but be default you would get newlines between values anywhere.

Parsing multiple occurrences of an item into a dictionary

Attempting to parse several separate image links from JSON data through python, but having some issues drilling down to the right level, due to what I believe is from having a list of strings.
For the majority of the items, I've had success with the below example, pulling back everything I need. Outside of this instance, everything is a 1:1 ratio of keys:values, but for this one, there are multiple values associated with one key.
resultsdict['item_name'] = item['attribute_key']
I've been adding it all to a resultsdict={}, but am only able to get to the below sample string when I print.
INPUT:
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
OUTPUT (only relevant section):
'images': [{u'VariationSpecificPictureSet': [{u'PictureURL': [u'http//imagelink1'], u'VariationSpecificValue': u'color1'}, {u'PictureURL': [u'http//imagelink2'], u'VariationSpecificValue': u'color2'}, {u'PictureURL': [u'http//imagelink3'], u'VariationSpecificValue': u'color3'}, {u'PictureURL': [u'http//imagelink4'], u'VariationSpecificValue': u'color4'}]
I feel like I could add ['VariationPictureSet']['PictureURL'] at the end of my initial input, but that throws an error due to the indices not being integers, but strings.
Ideally, I would like to see the output as a simple comma-separated list of just the URLs, as follows:
OUTPUT:
'images': http//imagelink1, http//imagelink2, http//imagelink3, http//imagelink4
An answer to your comment that required a bit of code to it.
When using
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures']
you get a list with one element, so I recommend using this
for item in data['Item']:
resultsdict['images'] = item['Variations']['Pictures'][0]
now you can use
for image in resultsdict['images']['VariationsSpecificPictureSet']:
print(image['PictureUR‌​L'])
Thanks for the help, #uzzee, it's appreciated. I kept tinkering with it and was able to pull the continuous string of all the image URLs with the following code.
resultsdict['images'] = sum([x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']],[])
Without the sum it looks like this and pulls in the whole list of lists...
resultsdict['images'] = [x['PictureURL'] for x in item['variations']['Pictures'][0]['VariationSpecificPictureSet']]

Search for specific XML element Attribute values

Using Python ElementTree to construct and edit test messages:
Part of XML as follows:
<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#" RptTyp="0" TrdDt="20120201" MtchTyp="4" LastMkt="ABCD" LastPx="104.11">
The key TrdID contain values beginning with $$ to identify that this value is variable data and needs to be amended once the message is constructed from a template, in this case to the next sequential number (stored in a dictionary - the overall idea is to load a dictionary from a file with the attribute key listed and the associated value such as the next sequential number e.g. dictionary file contains $$+TrdID# 12345 using space as the delim).
So far my script iterates the parsed XML and examines each indexed element in turn. There will be several fields in the xml file that require updating so I need to avoid using hard coded references to element tags.
How can I search the element/attribute to identify if the attribute contains a key where the corresponding value starts with or contains the specific string $$?
And for reasons unknown to me we cannot use lxml!
You can use XPath.
import lxml.etree as etree
import StringIO from StringIO
xml = """<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#"
RptTyp="0"
TrdDt="20120201"
MtchTyp="4"
LastMkt="ABCD"
LastPx="104.11"/>
</FIXML>"""
tree = etree.parse(StringIO(xml))
To find elements TrdMtchRpt where the attribute TrdID starts with $$:
r = tree.xpath("//TrdMtchRpt[starts-with(#TrdID, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
If you want to find any element where at least one attribute starts with $$ you can do this:
r = tree.xpath("//*[starts-with(#*, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
Look at the documentation:
http://lxml.de/xpathxslt.html#the-xpath-method
http://www.w3schools.com/xpath/xpath_functions.asp#string
http://www.w3schools.com/xpath/xpath_syntax.asp
You can use ElementTree package. It gives you an object with a hierarchical data structure from XML document.

Categories