I'm trying to scrape a website and get items list from it using python. I parsed the html using BeaufitulSoup and made a JSON file using json.loads(data). The JSON object looks like this:
{ ".1768j8gv7e8__0":{
"context":{
//some info
},
"pathname":"abc",
"showPhoneLoginDialog":false,
"showLoginDialog":false,
"showForgotPasswordDialog":false,
"isMobileMenuExpanded":false,
"showFbLoginEmailDialog":false,
"showRequestProductDialog":false,
"isContinueWithSite":true,
"hideCoreHeader":false,
"hideVerticalMenu":false,
"sequenceSeed":"web-157215950176521",
"theme":"default",
"offerCount":null
},
".1768j8gv7e8.6.2.0.0__6":{
"categories":[
],
"products":{
"count":12,
"items":[
{
//item info
},
{
//item info
},
{
//item info
}
],
"pageSize":50,
"nextSkip":100,
"hasMore":false
},
"featuredProductsForCategory":{
},
"currentCategory":null,
"currentManufacturer":null,
"type":"Search",
"showProductDetail":false,
"updating":false,
"notFound":false
}
}
I need the items list from product section. How can I extract that?
Just do:
products = jsonObject[list(jsonObject.keys())[1]]["products"]["items"]
import json packagee and map every entry to a list of items if it has any:
This solution is more universal, it will check all items in your json and find all the items without hardcoding the index of an element
import json
data = '{"p1": { "pathname":"abc" }, "p2": { "pathname":"abcd", "products": { "items" : [1,2,3]} }}'
# use json package to convert json string to dictionary
jsonData = json.loads(data)
type(jsonData) # dictionary
# use "list comprehension" to iterate over all the items in json file
# itemData['products']["items"] - select items from data
# if "products" in itemData.keys() - check if given item has products
[itemData['products']["items"] for itemId, itemData in jsonData.items() if "products" in itemData.keys()]
Edit: added comments to code
I'll just call the URL of the JSON file you got from BeautifulSoup "response" and then put in a sample key in the items array, like itemId:
import json
json_obj = json.load(response)
array = []
for i in json_obj['items']:
array[i] = i['itemId']
print(array)
Related
I'm new to Python and I'm quite stuck (I've gone through multiple other stackoverflows and other sites and still can't get this to work).
I've the below json coming out of an API connection
{
"results":[
{
"group":{
"mediaType":"chat",
"queueId":"67d9fb5e-26b2-4db5-b062-bbcfa8d2ca0d"
},
"data":[
{
"interval":"2021-01-14T13:12:19.000Z/2022-01-14T13:12:19.000Z",
"metrics":[
{
"metric":"nOffered",
"qualifier":null,
"stats":{
"max":null,
"min":null,
"count":14,
"count_negative":null,
"count_positive":null,
"sum":null,
"current":null,
"ratio":null,
"numerator":null,
"denominator":null,
"target":null
}
}
],
"views":null
}
]
}
]
}
and what I'm mainly looking to get out of it is (or at least something as close as)
MediaType
QueueId
NOffered
Chat
67d9fb5e-26b2-4db5-b062-bbcfa8d2ca0d
14
Is something like that possible? I've tried multiple things and I either get the whole of this out in one line or just get different errors.
The error you got indicates you missed that some of your values are actually a dictionary within an array.
Assuming you want to flatten your json file to retrieve the following keys: mediaType, queueId, count.
These can be retrieved by the following sample code:
import json
with open(path_to_json_file, 'r') as f:
json_dict = json.load(f)
for result in json_dict.get("results"):
media_type = result.get("group").get("mediaType")
queue_id = result.get("group").get("queueId")
n_offered = result.get("data")[0].get("metrics")[0].get("count")
If your data and metrics keys will have multiple indices you will have to use a for loop to retrieve every count value accordingly.
Assuming that the format of the API response is always the same, have you considered hardcoding the extraction of the data you want?
This should work: With response defined as the API output:
response = {
"results":[
{
"group":{
"mediaType":"chat",
"queueId":"67d9fb5e-26b2-4db5-b062-bbcfa8d2ca0d"
},
"data":[
{
"interval":"2021-01-14T13:12:19.000Z/2022-01-14T13:12:19.000Z",
"metrics":[
{
"metric":"nOffered",
"qualifier":'null',
"stats":{
"max":'null',
"min":'null',
"count":14,
"count_negative":'null',
"count_positive":'null',
"sum":'null',
"current":'null',
"ratio":'null',
"numerator":'null',
"denominator":'null',
"target":'null'
}
}
],
"views":'null'
}
]
}
]
}
You can extract the results as follows:
results = response["results"][0]
{
"mediaType": results["group"]["mediaType"],
"queueId": results["group"]["queueId"],
"nOffered": results["data"][0]["metrics"][0]["stats"]["count"]
}
which gives
{
'mediaType': 'chat',
'queueId': '67d9fb5e-26b2-4db5-b062-bbcfa8d2ca0d',
'nOffered': 14
}
I am getting following response from an API, I want to extract Phone Number from this object in Python. How can I do that?
{
"ParsedResults": [
{
"TextOverlay": {
"Lines": [
{
"Words": [
{
"WordText": "+971555389583", //this field
"Left": 0,
"Top": 5,
"Height": 12,
"Width": 129
}
],
"MaxHeight": 12,
"MinTop": 5
}
],
"HasOverlay": true,
"Message": "Total lines: 1"
},
"TextOrientation": "0",
"FileParseExitCode": 1,
"ParsedText": "+971555389583 \r\n",
"ErrorMessage": "",
"ErrorDetails": ""
}
],
"OCRExitCode": 1,
"IsErroredOnProcessing": false,
"ProcessingTimeInMilliseconds": "308",
"SearchablePDFURL": "Searchable PDF not generated as it was not requested."**strong text**}
Store the API response to a variable. Let's call it response.
Now convert the JSON string to a Python dictionary using the json module.
import json
response_dict = json.loads(response)
Now traverse through the response_dict to get the required text.
phone_number = response_dict["ParsedResults"][0]["TextOverlay"]["Lines"][0]["Words"][0]["WordText"]
Wherever the dictionary value is an array, [0] is used to access the first element of the array. If you want to access all elements of the array, you will have to loop through the array.
You have to parse the resulting stirng into a dictionary using the library json, afterwards you can traverse the result by looping over the json structure like this:
import json
raw_output = '{"ParsedResults": [ { "Tex...' # your api response
json_output = json.loads(raw_output)
# iterate over all lists
phone_numbers = []
for parsed_result in json_output["ParsedResults"]:
for line in parsed_result["TextOverlay"]["Lines"]:
# now add all phone numbers in "Words"
phone_numbers.extend([word["WordText"] for word in line["Words"]])
print(phone_numbers)
You may want to check if all keys exist within that process, depending on the API you use, like
# ...
for line in parsed_result["TextOverlay"]["Lines"]:
if "Words" in line: # make sure key exists
phone_numbers.extend([word["WordText"] for word in line["Words"]])
# ...
I'm trying to parse the following JSON data and get URL values using python Function.From the below JSON example I would like to get the URL from under the Jobs tag and store it in 2 arrays. 1 array will store URL that has color tag and other will store URL that do not have color tag. Once the 2 arrays are ready I would like to return these two arrays. I'm very new to python and need some help with this.
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"actions":[ ],
"description":"This is a TSG level folder.",
"displayName":"CONSOLIDATED",
"displayNameOrNull":null,
"fullDisplayName":"CONSOLIDATED",
"fullName":"CONSOLIDATED",
"name":"CONSOLIDATED",
"url":"https://cyggm.com/job/CONSOLIDATED/",
"healthReport":[
{
"description":"Projects enabled for building: 187 of 549",
"iconClassName":"icon-health-20to39",
"iconUrl":"health-20to39.png",
"score":34
}
],
"jobs":[
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"yyfyiff",
"url":"https://tdyt.com/job/
CONSOLIDATED/job/yfiyf/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"Ops-Prod-Jobs",
"url":"https://ygduey.com/job/
CONSOLIDATED/job/Ops-Prod-Jobs/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"TEST-DATA-MGMT",
"url":"https://futfu.com/job/
CONSOLIDATED/job/TEST-DATA-MGMT/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"TESTING-OPS",
"url":"https://gfutfu.com/job/
CONSOLIDATED/job/TESTING-OPS/"
},
{
"_class":"com.cloudbees.hudson.plugins.folder.Folder",
"name":"Performance_Engineering Team",
"url":"https://ytdyt.com/job/
CONSOLIDATED/job/Performance_Engineering%20Team/"
},
{
"_class":"hudson.model.FreeStyleProject",
"name":"test",
"url":"https://tduta.com/job/
CONSOLIDATED/job/test/",
"color":"notbuilt"
}
],
"primaryView":{
"_class":"hudson.model.AllView",
"name":"all",
"url":"https://fuyfi.com/job/
CONSOLIDATED/"
},
"views":[
{
"_class":"hudson.model.AllView",
"name":"all",
"url":"https://utfufu.com/job/
CONSOLIDATED/"
}
]
}
The following is the python code I used to get the jobs data but then I'm not able to iterate through the jobs data to get all URL. I'm only getting 1 at a time if I change the code
req = requests.get(url, verify=False, auth=(username, password))
j = json.loads(req.text)
jobs = j['jobs']
print(jobs[1]['url'])
I'm getting 2nd URL here but no way to check if this entry has color tag
First of all, your JSON is improperly formatted. You will have to use a JSON formatter to check its validity and fix any issues.
That said, you'll have to read in the file as a string with
In [87]: with open('data.json', 'r') as f:
...: data = f.read()
...:
Then using the json library, load the data into a dict
In [88]: d = json.loads(data)
You can then use 2 list comprehensions to get the data you want
In [90]: no_color = [record['url'] for record in d['jobs'] if 'color' not in record]
In [91]: color = [record['url'] for record in d['jobs'] if 'color' in record]
In [93]: no_color
Out[93]:
['https://tdyt.com/job/CONSOLIDATED/job/yfiyf/',
'https://ygduey.com/job/CONSOLIDATED/job/Ops-Prod-Jobs/',
'https://futfu.com/job/CONSOLIDATED/job/TEST-DATA-MGMT/',
'https://gfutfu.com/job/CONSOLIDATED/job/TESTING-OPS/',
'https://ytdyt.com/job/CONSOLIDATED/job/Performance_Engineering%20Team/']
In [94]: color
Out[94]: ['https://tduta.com/job/CONSOLIDATED/job/test/']
I'm kinda new JSON and python and i wish to use the keys and values of JSON to compare it.
I'm getting the JSON from a webpage using requests lib.
Recently, I've done this:
import requests;
URL = 'https://.../update.php';
PARAMS = { 'platform':'0', 'authcode':'code', 'app':'60' };
request = requests.get( url=URL, params=PARAMS );
data = request.json( );
I used this loop to get the keys and values from that json:
for key, value in data.items( ):
print( key, value );
it return JSON part like this:
rescode 0
desc success
config {
"app":"14",
"os":"2",
"version":"6458",
"lang":"zh-CN",
"minimum":"5",
"versionName":"3.16.0-6458",
"md5":"",
"explain":"",
"DiffUpddate":[ ]
}
But in Firefox using pretty print i get different result look like this:
{
"rescode": 0,
"desc": "success",
"config": "{
\n\t\"app\":\"14\",
\n\t\"os\":\"2\",
\n\t\"version\":\"6458\",
\n\t\"lang\":\"zh-CN\",
\n\t\"minimum\":\"5\",
\n\t\"versionName\":\"3.16.0-6458\",
\n\t\"md5\":\"\",
\n\t\"explain\":\"\",
\n\t\"DiffUpddate\":[\n\t\t\n\t]\n
}"
}
What I'm planing to do is:
if data['config']['version'] == '6458':
print('TRUE!');
But everytime i get this error:
TypeError: string indices must be integers
You need to parse the config
json.loads(data['config'])['version']
Or edit the PHP to return an associative array rather than a string for the config object
I am trying to retrieve the urls from the sub elements given comp1 and comp2 as input to the python script
{
"main1": {
"comp1": {
"url": [
"http://kcdclcm.com",
"http://dacklsd.com"
]
},
"comp2": {
"url": [
"http://dccmsdlkm.com",
"http://clsdmcsm.com"
]
}
},
"main2": {
"comp3": {
"url": [
"http://csdc.com",
"http://uihjkn.com"
]
},
"comp4": {
"url": [
"http://jkll.com",
"http://ackjn.com"
]
}
}
}
Following is the snippet of the python function, I am trying to use to grab the urls
import json
data = json.load(open('test.json'))
def geturl(comp):
if comp in data[comp]:
for url in data[comp]['url']:
print url
geturl('comp1')
geturl('comp2')
I totally understand the error is in the 4th and 5th line of the script, since i am trying to grab the url information from the second element of the json data without passing the first element 'main1' or 'main2'. Same script works fine if I replace the 4th and 5th line as below:
if comp in data['main1']:
for url in data['main1'][comp]['url']:
In my case, i would not know main1 and main2 as the user would just pass comp1, comp2, comp3 and comp4 part as input to the script. Is there a way to find the url information given that only the second element is known
Any inputs would be highly appreciated.
You need to iterate through the keys/values in the dict to check if the second level key you are searching for is present:
import json
data = json.load(open('test.json'))
def geturl(comp):
for k, v in data.items():
if comp in v and 'url' in v[comp]:
print "%s" % "\n".join(v[comp]['url'])
geturl('comp1')
geturl('comp2')
If you want to search the urls with only comp key in every main, you just need to do it like this:
import json
data = json.load(open('test.json'))
def geturl(comp):
for mainKey in data:
main = data[mainKey]
if comp in main:
urls = main[comp]['url']
for url in urls:
print url
geturl('comp1')
geturl('comp2')