Python & JSON | Can't access dictionary using an index - python

I have a program that looks like this:
import json
import requests
article_name = "BT Centre"
article_api_url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles={}".format(article_name)
called_data = requests.get(article_api_url)
formatted_data = called_data.json()
print(formatted_data)
pages = formatted_data["query"]["pages"]
print(pages)
first_page = pages[0]["extract"]
print(first_page)
For the first print statement, where it prints the whole JSON, it returns this:
{
'batchcomplete': '',
'query':{
'pages':{
'18107207':{
'pageid': 18107207,
'ns': 0,
'title':'BT Centre',
'extract': "The BT Centre is the global headquarters and registered office of BT Group..."
}
}
}
}
When I try to access the "extract" data with the first_page variable, it returns:
Traceback (most recent call last):
File "wiki_json_printer.py", line 15, in <module>
first_page = pages[0]["extract"]
KeyError: 0
The problem is, I can't set first_page to pages["18107207"]["extract"] because the Page ID changes for every article.
Edit: Solution from Ann Zen works:
You can use a for loop to loop through the keys of the pages
dictionary, and detect which one is the ID via the str.isdigit()
method:
for key in pages:
if key.isdigit():
print(pages[key]["extract"])

You can use a for loop to loop through the keys of the pages dictionary, and detect which one is the ID via the str.isdigit() method:
for key in pages:
if key.isdigit():
print(pages[key]["extract"])

You could use next on an iterator on the dict to find the first key:
...
key = next(iter(pages))
first_page = pages[key]["extract"]
...

pages is dictionary not a list you can't select it by index, use it key
print(pages['18107207']['extract'])
of course the following will work because the key is 18107207
for key in pages:
print(pages[key]["extract"])

Related

What would be the best way to cycle through an array of elements from HTML in order to use 2 separate tag names in the order they came?

Not really sure how to word this question properly, but I'm basically playing around with python and using Selenium to scrape a website and I'm trying to create a JSON file with the data.
Here's the goal I'm aiming to achieve:
{
"main1" : {
"sub1" : "data",
"sub2" : "data",
"sub3" : "data",
"sub4" : "data"
},
"main2" : {
"sub1" : "data",
"sub2" : "data",
"sub3" : "data",
"sub4" : "data"
}
}
The problem I'm facing at the moment is that the website has no indentation or child elements. It looks like this (but longer and actual copy, of course):
<h3>Main1</h3>
<p>Sub1</p>
<p>Sub2</p>
<p>Sub3</p>
<p>Sub4</p>
<h3>Main2</h3>
Now I want to iterate through the HTML in order to use the <h3> tags as the parent ("Main" in the JSON example) and <p> tags as the children(sub[num]). I'm new to both python and Selenium, so I may have done this wrong, but I've tried using items.find_elements_by_tag_name('el') to separate two, but I don't know how to put them back together in the order that they originally came.
I then tried looping through all the elements and separating the tags using if (item.tag_name == "el"): loops. This works perfectly when I print the results of each loop, but when it comes to putting them together in a JSON file, I have the same issue as the previous method where I cannot seem to get the 2 to join. I've tried a few variations and I either get key errors or only the last item in the loop gets recorded.
Just for reference, here's the code for this step:
items = browser.find_element_by_xpath(
'//*[#id="main-content"]') #Main Content
itemList = items.find_elements_by_xpath(".//*")
statuses = [
"Status1",
"Status2",
"Status3",
"Status4"
]
for item in itemList: #iterate through the HTML
if (item.tag_name == "h3"): #Separate H3 Tags
main = item.text
print("======================================")
print(main)
print("======================================")
if (item.tag_name == 'p'): #Separate P tags
for status in statuses:
if(status in item.text): #Filter P tags to only display info that contains words in the Status array
delimeters = ":", "(", "See"
regexPattern = "|".join(map(re.escape, delimeters))
zoneData = re.split(regexPattern, item.text)
#Split P tags into separate parts
sub1 = zoneData[0]
sub2 = zoneData[1].translate({ord('*'): None})
sub3 = zoneData[2].translate({ord(")"): None})
print(sub1)
print(sub2)
print(sub3)
The final option I've decided to try is to try going through all the HTML again, but using enumerate() and using the element's IDs and including all the tags between the 2 IDs, but I'm not really sure what my plan of action is with this just yet.
In general, the last option seems a bit convoluted and I'm pretty certain there's a simpler way to do this. What would you suggest?
Here's my idea, but I didn't do the data part, you can add it later.
I assume that there's no duplicate in main name, or else you will lose some info.
items = browser.find_element_by_xpath(
'//*[#id="main-content"]') #Main Content
itemList = items.find_elements_by_xpath(".//p|.//h3") # only finds h3 or p
def construct(item_list):
current_main = ''
final_dict: dict = {}
for item in item_list:
if item.tag_name == "h3":
current_main = item.text
final_dict[current_main] = {} # create empty dict inside main. remove if you want to update the main dict
if item.tag_name == "p":
p_name = item.text
final_dict[current_main][p_name] = "data"
return final_dict

List out auto scaling group names with a specific application tag using boto3

I was trying to fetch auto scaling groups with Application tag value as 'CCC'.
The list is as below,
gweb
prd-dcc-eap-w2
gweb
prd-dcc-emc
gweb
prd-dcc-ems
CCC
dev-ccc-wer
CCC
dev-ccc-gbg
CCC
dev-ccc-wer
The script I coded below gives output which includes one ASG without CCC tag.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccc_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
for j in range(len(all_tags)):
if all_tags[j]['Key'] == 'Name':
asg_name = all_tags[j]['Value']
# print asg_name
if all_tags[j]['Key'] == 'Application':
app = all_tags[j]['Value']
# print app
if all_tags[j]['Value'] == 'CCC':
ccc_asg.append(asg_name)
print ccc_asg
The output which I am getting is as below,
['prd-dcc-ein-w2', 'dev-ccc-hap', 'dev-ccc-wfd', 'dev-ccc-sdf']
Where as 'prd-dcc-ein-w2' is an asg with a different tag 'gweb'. And the last one (dev-ccc-msp-agt-asg) in the CCC ASG list is missing. I need output as below,
dev-ccc-hap-sdf
dev-ccc-hap-gfh
dev-ccc-hap-tyu
dev-ccc-mso-hjk
Am I missing something ?.
In boto3 you can use Paginators with JMESPath filtering to do this very effectively and in more concise way.
From boto3 docs:
JMESPath is a query language for JSON that can be used directly on
paginated results. You can filter results client-side using JMESPath
expressions that are applied to each page of results through the
search method of a PageIterator.
When filtering with JMESPath expressions, each page of results that is
yielded by the paginator is mapped through the JMESPath expression. If
a JMESPath expression returns a single value that is not an array,
that value is yielded directly. If the result of applying the JMESPath
expression to a page of results is a list, then each value of the list
is yielded individually (essentially implementing a flat map).
Here is how it looks like in Python code with mentioned CCP value for Application tag of Auto Scaling Group:
import boto3
client = boto3.client('autoscaling')
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filtered_asgs = page_iterator.search(
'AutoScalingGroups[] | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(
'Application', 'CCP')
)
for asg in filtered_asgs:
print asg['AutoScalingGroupName']
Elaborating on Michal Gasek's answer, here's an option that filters ASGs based on a dict of tag:value pairs.
def get_asg_name_from_tags(tags):
asg_name = None
client = boto3.client('autoscaling')
while True:
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filter = 'AutoScalingGroups[]'
for tag in tags:
filter = ('{} | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(filter, tag, tags[tag]))
filtered_asgs = page_iterator.search(filter)
asg = filtered_asgs.next()
asg_name = asg['AutoScalingGroupName']
try:
asgX = filtered_asgs.next()
asgX_name = asg['AutoScalingGroupName']
raise AssertionError('multiple ASG\'s found for {} = {},{}'
.format(tags, asg_name, asgX_name))
except StopIteration:
break
return asg_name
eg:
asg_name = get_asg_name_from_tags({'Env':env, 'Application':'app'})
It expects there to be only one result and checks this by trying to use next() to get another. The StopIteration is the "good" case, which then breaks out of the paginator loop.
I got it working with below script.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccp_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
app = False
asg_name = ''
for j in range(len(all_tags)):
if 'Application' in all_tags[j]['Key'] and all_tags[j]['Value'] in ('CCP'):
app = True
if app:
if 'Name' in all_tags[j]['Key']:
asg_name = all_tags[j]['Value']
ccp_asg.append(asg_name)
print ccp_asg
Feel free to ask if you have any doubts.
The right way to do this isn't via describe_auto_scaling_groups at all but via describe_tags, which will allow you to make the filtering happen on the server side.
You can construct a filter that asks for tag application instances with any of a number of values:
Filters=[
{
'Name': 'key',
'Values': [
'Application',
]
},
{
'Name': 'value',
'Values': [
'CCC',
]
},
],
And then your results (in Tags in the response) are all the times when a matching tag is applied to an autoscaling group. You will have to make the call multiple times, passing back NextToken every time there is one, to go through all the pages of results.
Each result includes an ASG ID that the matching tag is applied to. Once you have all the ASG IDs you are interested in, then you can call describe_auto_scaling_groups to get their names.
yet another solution, in my opinion simple enough to extend:
client = boto3.client('autoscaling')
search_tags = {"environment": "stage"}
filtered_asgs = []
response = client.describe_auto_scaling_groups()
for group in response['AutoScalingGroups']:
flattened_tags = {
tag_info['Key']: tag_info['Value']
for tag_info in group['Tags']
}
if search_tags.items() <= flattened_tags.items():
filtered_asgs.append(group)
print(filtered_asgs)

Selecting values from a JSON file in Python

I am getting JIRA data using the following python code,
how do I store the response for more than one key (my example shows only one KEY but in general I get lot of data) and print only the values corresponding to total,key, customfield_12830, summary
import requests
import json
import logging
import datetime
import base64
import urllib
serverURL = 'https://jira-stability-tools.company.com/jira'
user = 'username'
password = 'password'
query = 'project = PROJECTNAME AND "Build Info" ~ BUILDNAME AND assignee=ASSIGNEENAME'
jql = '/rest/api/2/search?jql=%s' % urllib.quote(query)
response = requests.get(serverURL + jql,verify=False,auth=(user, password))
print response.json()
response.json() OUTPUT:-
http://pastebin.com/h8R4QMgB
From the the link you pasted to pastebin and from the json that I saw, its a you issues as list containing key, fields(which holds custom fields), self, id, expand.
You can simply iterate through this response and extract values for keys you want. You can go like.
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = {
'key': issue['key'],
'customfield': issue['fields']['customfield_12830'],
'total': issue['fields']['progress']['total']
}
x.append(temp)
print(x)
x is list of dictionaries containing the data for fields you mentioned. Let me know if I have been unclear somewhere or what I have given is not what you are looking for.
PS: It is always advisable to use dict.get('keyname', None) to get values as you can always put a default value if key is not found. For this solution I didn't do it as I just wanted to provide approach.
Update: In the comments you(OP) mentioned that it gives attributerror.Try this code
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = dict()
key = issue.get('key', None)
if key:
temp['key'] = key
fields = issue.get('fields', None)
if fields:
customfield = fields.get('customfield_12830', None)
temp['customfield'] = customfield
progress = fields.get('progress', None)
if progress:
total = progress.get('total', None)
temp['total'] = total
x.append(temp)
print(x)

add an array of linked document _ids to couchdb documents in python

I want to add a links property to each couchdb document based on data in a csv file.
the value of the links property is to be an array of dicts containing the couchdb _id of the linked document and the linkType
When I run the script i get a links error (see error info below)
I am not sure how to create the dict key links if it doesn't exist and add the link data, or otherwise append to the links array if it does exist.
an example of a document with the links will look like this:
{
_id: p_3,
name: 'Smurfette'
links: [
{to_id: p_2, linkType: 'knows'},
{to_id: o_56, linkType: 'follows'}
]
}
python script for processing the csv file:
#!/usr/bin/python
# coding: utf-8
# Version 1
#
# csv fields: ID,fromType,fromID,toType,toID,LinkType,Directional
import csv, sys, couchdb
def csv2couchLinks(database, csvfile):
# CouchDB Database Connection etc
server = couchdb.Server()
#assumes that couchdb runs on http://localhost:5984
db = server[database]
#assumes that db is already created
# CSV file
data = csv.reader(open(csvfile, "rb")) # Read in the CSV file rb=read/binary
csv_links= csv.DictReader(open(csvfile, "rb"))
def makeLink(from_id, to_id, linkType):
# get doc from db
doc = db[from_id]
# construct link object
link = {'to_id':to_id, 'linkType':linkType}
# add link reference to array at key 'links'
if doc['links'] in doc:
doc['links'].append(link)
else:
doc['links'] = [link]
# update the record in the database
db[doc.id] = doc
# read each row in csv file
for row in csv_links:
# get entityTypes as lowercase and entityIDs
fromType = row['fromType'].lower()
fromID = row['fromID']
toType = row['toType'].lower()
toID = row['toID']
linkType = row['LinkType']
# concatenate 'entity type' and 'id' to make couch '_id'
fromIDcouch = fromType[0]+'_'+fromID #eg 'p_2' <= person 2
toIDcouch = toType[0]+'_'+toID
makeLink(fromIDcouch, toIDcouch, linkType)
makeLink(toIDcouch, fromIDcouch, linkType)
# Run csv2couchLinks() if this is not an imported module
if __name__ == '__main__':
DATABASE = sys.argv[1]
CSVFILE = sys.argv[2]
csv2couchLinks(DATABASE,CSVFILE)
error info:
$ python LINKS_csv2couchdb_v1.py "qmhonour" "./tablesAsCsv/links.csv"
Traceback (most recent call last):
File "LINKS_csv2couchdb_v1.py", line 65, in <module>
csv2couchLinks(DATABASE,CSVFILE)
File "LINKS_csv2couchdb_v1.py", line 57, in csv2couchLinks
makeLink(fromIDcouch, toIDcouch, linkType)
File "LINKS_csv2couchdb_v1.py", line 33, in makeLink
if doc['links'] in doc:
KeyError: 'links'
Another option is condensing the if block to this:
doc.setdefault('links', []).append(link)
The dictionary's setdefault method checks to see if links exists in the dictionary, and if it doesn't, it creates a key and makes the value an empty list (the default). It then appends link to that list. If links does exist, it just appends link to the list.
def makeLink(from_id, to_id, linkType):
# get doc from db
doc = db[from_id]
# construct link object
link = {'to_id':to_id, 'linkType':linkType}
# add link reference to array at key 'links'
doc.setdefault('links', []).append(link)
# update the record in the database
db[doc.id] = doc
Replace:
if doc['links'] in doc:
With:
if 'links' in doc:

How do I access a dictionary value for use with the urllib module in python?

Example - I have the following dictionary...
URLDict = {'OTX2':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=OTX2&action=view_all',
'RAB3GAP':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=RAB3GAP1&action=view_all',
'SOX2':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=SOX2&action=view_all',
'STRA6':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=STRA6&action=view_all',
'MLYCD':'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=MLYCD&action=view_all'}
I would like to use urllib to call each url in a for loop, how can this be done?
I have successfully done this with with the urls in a list format like this...
OTX2 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=OTX2&action=view_all'
RAB3GAP = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=RAB3GAP1&action=view_all'
SOX2 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=SOX2&action=view_all'
STRA6 = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=STRA6&action=view_all'
MLYCD = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=MLYCD&action=view_all'
URLList = [OTX2,RAB3GAP,SOX2,STRA6,PAX6,MLYCD]
for URL in URLList:
sourcepage = urllib.urlopen(URL)
sourcetext = sourcepage.read()
but I want to also be able to print the key later when returning data. Using a list format the key would be a variable and thus not able to access it for printing, I would lonly be able to print the value.
Thanks for any help.
Tom
Have you tried (as a simple example):
for key, value in URLDict.iteritems():
print key, value
Doesn't look like a dictionary is even necessary.
dbs = ['OTX2', 'RAB3GAP', 'SOX2', 'STRA6', 'PAX6', 'MLYCD']
urlbase = 'http://lsdb.hgu.mrc.ac.uk/variants.php?select_db=%s&action=view_all'
for db in dbs:
sourcepage = urllib.urlopen(urlbase % db)
sourcetext = sourcepage.read()
I would go about it like this:
for url_key in URLDict:
URL = URLDict[url_key]
sourcepage = urllib.urlopen(URL)
sourcetext = sourcepage.read()
The url is obviously URLDict[url_key] and you can retain the key value within the name url_key. For exemple:
print url_key
On the first iteration will printOTX2.

Categories