I am trying to retrieve the urls from the sub elements given comp1 and comp2 as input to the python script
{
"main1": {
"comp1": {
"url": [
"http://kcdclcm.com",
"http://dacklsd.com"
]
},
"comp2": {
"url": [
"http://dccmsdlkm.com",
"http://clsdmcsm.com"
]
}
},
"main2": {
"comp3": {
"url": [
"http://csdc.com",
"http://uihjkn.com"
]
},
"comp4": {
"url": [
"http://jkll.com",
"http://ackjn.com"
]
}
}
}
Following is the snippet of the python function, I am trying to use to grab the urls
import json
data = json.load(open('test.json'))
def geturl(comp):
if comp in data[comp]:
for url in data[comp]['url']:
print url
geturl('comp1')
geturl('comp2')
I totally understand the error is in the 4th and 5th line of the script, since i am trying to grab the url information from the second element of the json data without passing the first element 'main1' or 'main2'. Same script works fine if I replace the 4th and 5th line as below:
if comp in data['main1']:
for url in data['main1'][comp]['url']:
In my case, i would not know main1 and main2 as the user would just pass comp1, comp2, comp3 and comp4 part as input to the script. Is there a way to find the url information given that only the second element is known
Any inputs would be highly appreciated.
You need to iterate through the keys/values in the dict to check if the second level key you are searching for is present:
import json
data = json.load(open('test.json'))
def geturl(comp):
for k, v in data.items():
if comp in v and 'url' in v[comp]:
print "%s" % "\n".join(v[comp]['url'])
geturl('comp1')
geturl('comp2')
If you want to search the urls with only comp key in every main, you just need to do it like this:
import json
data = json.load(open('test.json'))
def geturl(comp):
for mainKey in data:
main = data[mainKey]
if comp in main:
urls = main[comp]['url']
for url in urls:
print url
geturl('comp1')
geturl('comp2')
Related
Problem: The output of this code seems to be repeating alot of the same entries in the final list, thus making it exponentially longer.
The goal would be complete the query and the print the final list with all city within the region
[
{
"name": "Herat",
"id": "AF~HER~Herat"
}
]
[
{
"name": "Herat",
"id": "AF~HER~Herat"
},
{
"name": "Kabul",
"id": "AF~KAB~Kabul"
}
]
[
{
"name": "Herat",
"id": "AF~HER~Herat"
},
{
"name": "Kabul",
"id": "AF~KAB~Kabul"
},
{
"name": "Kandahar",
"id": "AF~KAN~Kandahar"
}
]
My goal is to to a get a list with cityID. I first to a GET request and parse the JSON response to get the country IDs to a list,
Second: I have a for loop, which will make another GET request for the region id, but i now need to add the country IDs to the api url. I do that by adding .format on the GET request. and iterate trough all the countries and there respective region IDs, i parse them and store them in a list.
Third: i have another for loop, which will make another GET request for the cityID that will loop trough all cities with the above Region ID list, and the respectively collect the cityID that i really need.
Code :
from requests.auth import HTTPBasicAuth
import requests
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def countries():
data = requests.get("https://localhost/api/netim/v1/countries/", verify=False, auth=HTTPBasicAuth("admin", "admin"))
rep = data.json()
a = []
for elem in rep['items']:
a.extend([elem.get("id","")])
print(a)
return a
def regions():
ids = []
for c in countries():
url = requests.get("https://localhost/api/netim/v1/countries/{}/regions".format(c), verify=False, auth=HTTPBasicAuth("admin", "admin"))
response = url.json()
for cid in response['items']:
ids.extend([cid.get("id","")])
data = []
for r in ids:
url = requests.get("https://localhost/api/netim/v1/regions/{}/cities".format(r), verify=False, auth=HTTPBasicAuth("admin", "admin"))
response = url.json()
data.extend([{"name":r.get("name",""),"id":r.get("id", "")} for r in response['items']])
print(json.dumps(data, indent=4))
return data
regions()
print(regions())
You will see thou output contains several copies of the same entry.
Not a programmer, not sure where am i getting it wrong
It looks as though the output you're concerned with might be due to the fact that you're printing data as you iterate through it in the regions() method.
Try to remove the line:
print(json.dumps(data, indent=4))?
Also, and more importantly - you're setting data to an empty list every time you iterate on an item in Countries. You should probably declare that variable before the initial loop.
You're already printing the final result when you call the function. So printing as you iterate only really makes sense if you're debugging & needing to review the data as you go through it.
I'm trying to scrape a website and get items list from it using python. I parsed the html using BeaufitulSoup and made a JSON file using json.loads(data). The JSON object looks like this:
{ ".1768j8gv7e8__0":{
"context":{
//some info
},
"pathname":"abc",
"showPhoneLoginDialog":false,
"showLoginDialog":false,
"showForgotPasswordDialog":false,
"isMobileMenuExpanded":false,
"showFbLoginEmailDialog":false,
"showRequestProductDialog":false,
"isContinueWithSite":true,
"hideCoreHeader":false,
"hideVerticalMenu":false,
"sequenceSeed":"web-157215950176521",
"theme":"default",
"offerCount":null
},
".1768j8gv7e8.6.2.0.0__6":{
"categories":[
],
"products":{
"count":12,
"items":[
{
//item info
},
{
//item info
},
{
//item info
}
],
"pageSize":50,
"nextSkip":100,
"hasMore":false
},
"featuredProductsForCategory":{
},
"currentCategory":null,
"currentManufacturer":null,
"type":"Search",
"showProductDetail":false,
"updating":false,
"notFound":false
}
}
I need the items list from product section. How can I extract that?
Just do:
products = jsonObject[list(jsonObject.keys())[1]]["products"]["items"]
import json packagee and map every entry to a list of items if it has any:
This solution is more universal, it will check all items in your json and find all the items without hardcoding the index of an element
import json
data = '{"p1": { "pathname":"abc" }, "p2": { "pathname":"abcd", "products": { "items" : [1,2,3]} }}'
# use json package to convert json string to dictionary
jsonData = json.loads(data)
type(jsonData) # dictionary
# use "list comprehension" to iterate over all the items in json file
# itemData['products']["items"] - select items from data
# if "products" in itemData.keys() - check if given item has products
[itemData['products']["items"] for itemId, itemData in jsonData.items() if "products" in itemData.keys()]
Edit: added comments to code
I'll just call the URL of the JSON file you got from BeautifulSoup "response" and then put in a sample key in the items array, like itemId:
import json
json_obj = json.load(response)
array = []
for i in json_obj['items']:
array[i] = i['itemId']
print(array)
Suppose I have a JSON called jsondata.json:
{
"apiURL": [
{
"name":"Target",
"url":"https://redsky.target.com/v2/plp/collection/13562231,14919690,13728423,14919033,13533833,13459265,14917313,13519319,13533837,14919691,13479115,47778362,15028201,51467685,50846848,50759802,50879657,13219631,13561421,52062271,14917361,51803965,13474244,13519318?key=eb2551e4accc14f38cc42d32fbc2b2ea&pricing_store_id=2088&multichannel_option=basics&storeId=321"
},
{
"name":"Safeway",
"url":"https://shop.safeway.com/bin/safeway/product/aemaisle?aisleId=1_23_2&storeId=1483"
}
]
}
I want to tell my script to retrieve data from the API the url contains as follows:
# Load JSON containing URLs of APIs of grocery stores
with open(json_data, 'r') as data_f:
data_dict = json.load(data_f)
# Organize API URLs
for apiurl in data_dict['apiURL']:
responses.append('')
responses[index] = requests.get(apiurl['url'])
responses[index].raise_for_status()
storenames.append(apiurl['name'])
index += 1
first_target_item = responses[0].json()['search_response']['items']['Item'][0]['title']
first_safeway_item = responses[1].json()['productsinfo'][0]['description']
As you can see, my current implementation requires me to manually enter to my script which key to parse from each API (last two lines). I want to eventually be able to retrieve information from a dynamic number of grocery stores, but each website stores data on their items in a different key of their API.
How can I automate the process (e.g. store the key to parse from in jsondata.json) so that I don't have to update my script every time I add a new grocery store?
If you are okay with modifying the jsondata.json you can keep an array like this:
{
"name":"Target",
"accessKeys": ["search_response", "items", "Item", "0", "title"],
"url":"https://redsky.target.com/v2/plp/collection/13562231,14919690,13728423,14919033,13533833,13459265,14917313,13519319,13533837,14919691,13479115,47778362,15028201,51467685,50846848,50759802,50879657,13219631,13561421,52062271,14917361,51803965,13474244,13519318?key=eb2551e4accc14f38cc42d32fbc2b2ea&pricing_store_id=2088&multichannel_option=basics&storeId=321",
}
In your python code:
keys=["search_response", "items", "Item", "0", "title"] #apiUrl['accessKeys']
target_item=responses[0].json()
for i in target_item:
target_item=target_item[i]
You can automate more,
def get_keys(data, keys):
for key in keys:
data=data[key]
return data
items=[]
for index, apiurl in enumerate(data_dict['apiURL']):
responses.append('')
responses[index] = requests.get(apiurl['url'])
responses[index].raise_for_status()
storenames.append(apiurl['name'])
items.append(get_keys(responses[index].json(), apiUrl['accessKeys']))
I'm trying to parse the following JSON data and get URL values using python Function.From the below JSON example I would like to get the URL from under the Jobs tag and store it in 2 arrays. 1 array will store URL that has color tag and other will store URL that do not have color tag. Once the 2 arrays are ready I would like to return these two arrays. I'm very new to python and need some help with this.
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"actions":[ ],
"description":"This is a TSG level folder.",
"displayName":"CONSOLIDATED",
"displayNameOrNull":null,
"fullDisplayName":"CONSOLIDATED",
"fullName":"CONSOLIDATED",
"name":"CONSOLIDATED",
"url":"https://cyggm.com/job/CONSOLIDATED/",
"healthReport":[
{
"description":"Projects enabled for building: 187 of 549",
"iconClassName":"icon-health-20to39",
"iconUrl":"health-20to39.png",
"score":34
}
],
"jobs":[
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"yyfyiff",
"url":"https://tdyt.com/job/
CONSOLIDATED/job/yfiyf/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"Ops-Prod-Jobs",
"url":"https://ygduey.com/job/
CONSOLIDATED/job/Ops-Prod-Jobs/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"TEST-DATA-MGMT",
"url":"https://futfu.com/job/
CONSOLIDATED/job/TEST-DATA-MGMT/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"TESTING-OPS",
"url":"https://gfutfu.com/job/
CONSOLIDATED/job/TESTING-OPS/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"Performance_Engineering Team",
"url":"https://ytdyt.com/job/
CONSOLIDATED/job/Performance_Engineering%20Team/"
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"test",
"url":"https://tduta.com/job/
CONSOLIDATED/job/test/",
"color":"notbuilt"
}
],
"primaryView":{
"_class":"hudson.model.AllView",
"name":"all",
"url":"https://fuyfi.com/job/
CONSOLIDATED/"
},
"views":[
{
"_class":"hudson.model.AllView",
"name":"all",
"url":"https://utfufu.com/job/
CONSOLIDATED/"
}
]
}
The following is the python code I used to get the jobs data but then I'm not able to iterate through the jobs data to get all URL. I'm only getting 1 at a time if I change the code
req = requests.get(url, verify=False, auth=(username, password))
j = json.loads(req.text)
jobs = j['jobs']
print(jobs[1]['url'])
I'm getting 2nd URL here but no way to check if this entry has color tag
First of all, your JSON is improperly formatted. You will have to use a JSON formatter to check its validity and fix any issues.
That said, you'll have to read in the file as a string with
In [87]: with open('data.json', 'r') as f:
...: data = f.read()
...:
Then using the json library, load the data into a dict
In [88]: d = json.loads(data)
You can then use 2 list comprehensions to get the data you want
In [90]: no_color = [record['url'] for record in d['jobs'] if 'color' not in record]
In [91]: color = [record['url'] for record in d['jobs'] if 'color' in record]
In [93]: no_color
Out[93]:
['https://tdyt.com/job/CONSOLIDATED/job/yfiyf/',
'https://ygduey.com/job/CONSOLIDATED/job/Ops-Prod-Jobs/',
'https://futfu.com/job/CONSOLIDATED/job/TEST-DATA-MGMT/',
'https://gfutfu.com/job/CONSOLIDATED/job/TESTING-OPS/',
'https://ytdyt.com/job/CONSOLIDATED/job/Performance_Engineering%20Team/']
In [94]: color
Out[94]: ['https://tduta.com/job/CONSOLIDATED/job/test/']
I think I am messing up my xpath. What I am trying to do is get the information of each row on the table in this page.
This is what I have so far but its not outputting what I'm looking for.
import requests
from lxml import etree
r = requests.get('http://mtgoclanteam.com/Cards?edition=DTK')
doc = etree.HTML(r.text)
#get list of cards
cards = [card for card in doc.xpath('id("cardtable")/x:tbody/x:tr[1]/x:td[3]')]
for card in cards:
print card
The primary problem here is that the actual document served from the server contains an empty table:
<table id="cardtable" class="cardlist"/>
The data is filled in after the page loads by the embedded javascript that follows the empty table element:
<script>
$('#cardtable').dataTable({
"aLengthMenu": [[25, 100, -1], [25, 100, "All"]],
"bDeferRender": true,
"aaSorting": [],
"bPaginate": false,
"aaData": [
...DATA IS HERE...
],
"aoColumns": [
{ "sTitle": "Card name", "sWidth": "260" },
{ "sTitle": "Rarity", "sWidth": "40" },
{ "sTitle": "Buy", "sWidth": "80" },
{ "sTitle": "Sell", "sWidth": "80" },
{ "sTitle": "Bots with stock" }]
})
</script>
The data itself is contained the aaData element of the dictionary
that is passed to the dataTable() method. Extracting this in Python
is going to be tricky (this isn't just a JSON document). Possibly a
suitable regular expression applied to the script text would get you
what you want (or just iterate over the lines of the script and take the one after the aaData key).
For example:
import pprint
import json
import requests
from lxml import etree
r = requests.get('http://mtgoclanteam.com/Cards?edition=DTK')
doc = etree.HTML(r.text)
script = doc.xpath('id("templatemo_content")/script')[0].text
found = False
result = None
for line in script.splitlines():
if found:
if '[' in line:
result=line
break
if 'aaData' in line:
found = True
if result:
result =json.loads('[' + result + ']')
pprint.pprint(result)
This is ugly and fragile (it would break if the format of the script
changed), but it works for the current input.