I have the following Python code:
import requests
import json
from bs4 import BeautifulSoup
url = requests.get('https://www.perfectimprints.com/custom-promos/20492/Beach-Balls.html')
source = BeautifulSoup(url.text, 'html.parser')
products = source.find_all('div', class_="product_wrapper")
def get_product_details(product):
product_name = product.find('div', class_="product_name").a.text
sku = product.find('div', class_="product_sku").text
product_link = product.find('div', class_="product_image_wrapper").find("a")["href"]
src = product.find('div', class_="product_image_wrapper").find('a').find("img")["src"]
return {
"title": product_name,
"link": product_link,
"sku": sku,
"src": src
}
all_products = [get_product_details(product) for product in products]
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
print("Success")
This code works perfectly as written. The problem is I want the structure instead of
[
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
I want it to be:
{
"items": [
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
}
Here's a link to what I have working in Repl.it, just so you don't have to set up your own: https://repl.it/repls/AttractiveDimpledTheory
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Here you're dumping your all_products list directly to JSON:
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
The JSON you want just has that list in an object. Something like
with open("products.json", "w") as write_file:
json.dump({'items': all_products}, write_file)
should do what you want.
Generally speaking there's a 1:1 relationship between your Python data structure and the JSON it generates. If you build the right Python data structure you'll get the right JSON. Here we're using a dict (which maps to a JSON object) to wrap your existing list (which maps to a JSON array).
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Assuming you also want to remove spaces, you can just use str.strip(), which strips whitespace by default:
return {
"title": product_name,
"link": product_link,
"sku": sku.strip(), # <-- here
"src": src
}
Related
Can I get specific values from the json?
I'm trying to get the id, but I don't know how exactly to do it.
This is the json:
{
"searchType": "Movie",
"expression": "Lord Of The Rings 2022.json",
"results": [{
"id": "tt18368278",
"resultType": "Title",
"image": "https://m.media-amazon.com/images/M/MV5BZTMwZjUzNGMtMWI3My00ZGJmLWFmYWEtYjk2YWYxYzI2NWRjXkEyXkFqcGdeQXVyODY0NzcxNw##._V1_Ratio1.7600_AL_.jpg",
"title": "The Lord of the Rings Superfans Review the Rings of Power",
"description": "(2022 Video)"
}],
"errorMessage": ""
}
I just want to get the values of result, but I want get specific values, for example the id.
This is my code:
import requests
import json
movie = input("Movies:")
base_url = ("https://imdb-api.com/en/API/SearchMovie/myapi/"+movie)
r = (requests.get(base_url))
b = r.json()
print(b['results'])
Considering your json valid, and to accommodate more than one result, you could do:
[...]
r = (requests.get(base_url))
b = r.json()
for result in b['results']:
print(result['id'])
To get just one item (first item from array), you can do:
print(b['results'][0]['id'])
Requests documentation: https://requests.readthedocs.io/en/latest/
I'm very new to programming so excuse any terrible explanations. Basically I have 1000 json files all that need to have the same text added to the end. Here is an example:
This is what it looks like now:
{"properties": {
"files": [
{
"uri": "image.png",
"type": "image/png"
}
],
"category": "image",
"creators": [
{
"address": "wallet address",
"share": 100
}
]
}
}
Which I want to look like this:
{"properties": {
"files": [
{
"uri": "image.png",
"type": "image/png"
}
],
"category": "image",
"creators": [
{
"address": "wallet address",
"share": 100
}
]
},
"collection": {"name": "collection name"}
}
I've tried my best with append and update but it always tells me there is no attribute to append. I also don't really know what I'm doing.
This will be embarrassing but here is what I tried and failed.
import json
entry= {"collection": {"name": "collection name"}}
for i in range((5)):
a_file = open("./testjsons/" + str(i) + ".json","r")
json_obj = json.load(a_file)
print(json_obj)
json_obj["properties"].append(entry)
a_file = open(str(i) + ".json","w")
json.dump(json_obj,a_file,indent=4)
a_file.close()
json.dump(a_file, f)
Error code: json_obj["properties"].append(entry)
AttributeError: 'dict' object has no attribute 'append'
you don't use append() to add to a dictionary. You can either assign to the key to add a single entry, or use .update() to merge dictionaries.
import json
entry= {"collection": {"name": "collection name"}}
for i in range((5)):
with open("./testjsons/" + str(i) + ".json","r") as a_file:
a_file = open("./testjsons/" + str(i) + ".json","r")
json_obj = json.load(a_file)
print(json_obj)
json_obj.update(entry)
with open(str(i) + ".json","w") as a_file:
json.dump(json_obj,a_file,indent=4)
JSON, like XML, is a specialized data format. You should always parse the data and work with it as JSON where possible. This is different from a plain text file where you would 'add to the end' or 'append' text.
There are a number of json parsing libraries in Python, but you'll probably want to use the json encoder that is built in to the standard Python library. For a file, myfile.json, you can:
import json
with open('myfile.json`, 'r') as f:
myfile = json.load(f) # read the file into a Python dict
myfile["collection"] = {"name": "collection name"} # here you're adding the "collection" field to the end of the Python dict
# If you want to add "collection" inside "properties", you'd do something like
#. myfile["properties"]["collection"] = {"name": "collection name"}
with open('myfile.json', 'w') as f:
json.dump(myfile, f) # save the modified dict into the json file
I write a script for web scraping and it is successfully scraping data. Only problem is with exporting data to JSON file
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
#here json machanism
json_job = json.dumps(job_dict)
with open('data.json', 'r+') as f:
f.write("[")
f.seek(0)
f.write(json_job)
txt = f.readline()
if txt.endswith("}"):
f.write(",")
def crawl_web(url):
while True:
post_url = get_post_url(url)
for urls in post_url:
urls = urls
scrape_post_info(urls)
# Execute the main fuction 'crawl_web'
if __name__ == '__main__':
crawl_web('www.examp....com')
The data is exported to JSON but it is not proper format of JSON. I am expecting the data should look like:
[
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
},
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
},
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
},
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
},
]
How can I achieve this?
How about:
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
return {"title": title, "Description": description, "url": post_url}
def crawl_web(url):
while True:
jobs = []
post_urls = get_post_url(url)
for url in post_urls:
jobs.append(scrape_post_info(url))
with open("data.json", "w") as f:
json.dumps(jobs)
# Execute the main fuction 'crawl_web'
if __name__ == "__main__":
crawl_web("www.examp....com")
Note that this will rewrite your entire file on each iteration of "post_urls", so it might become quite slow with large files and slow I/O.
Depending on how long your job is running and how much memory you have, you might want to move the file writing out of the for loop, and only write it out once.
Note: if you really want to write JSON streaming, you might want to look at something like this package: https://pypi.org/project/jsonstreams/, however I'd suggest to choose another format such as CSV that is much more well suited to streaming writes.
Okay, so I've been banging my head on this for the last 2 days, with no real progress. I am a beginner with python and coding in general, but this is the first issue I haven't been able to solve myself.
So I have this long file with JSON formatting with about 7000 entries from the youtubeapi.
right now I want to have a short script to print certain info ('videoId') for a certain dictionary key (refered to as 'key'):
My script:
import json
f = open ('path file.txt', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['key']['Items']['id']['videoId'])
# print(trailers['key']['videoId'] gives same response
Error:
print(trailers['key']['Items']['id']['videoId'])
TypeError: string indices must be integers
It does work when I want to print all the information for the dictionary key:
This script works
import json
f = open ('path file.txt', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['key'])
Also print(type(trailers)) results in class 'dict', as it's supposed to.
My JSON File is formatted like this and is from the youtube API, youtube#searchListResponse.
{
"kind": "youtube#searchListResponse",
"etag": "",
"nextPageToken": "",
"regionCode": "",
"pageInfo": {
"totalResults": 1000000,
"resultsPerPage": 1
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "",
"id": {
"kind": "youtube#video",
"videoId": ""
},
"snippet": {
"publishedAt": "",
"channelId": "",
"title": "",
"description": "",
"thumbnails": {
"default": {
"url": "",
"width": 120,
"height": 90
},
"medium": {
"url": "",
"width": 320,
"height": 180
},
"high": {
"url": "",
"width": 480,
"height": 360
}
},
"channelTitle": "",
"liveBroadcastContent": "none"
}
}
]
}
What other information is needed to be given for you to understand the problem?
The following code gives me all the videoId's from the provided sample data (which is no id's at all in fact):
import json
with open('sampledata', 'r') as datafile:
data = json.loads(datafile.read())
print([item['id']['videoId'] for item in data['items']])
Perhaps you can try this with more data.
Hope this helps.
I didn't really look into the youtube api but looking at the code and the sample you gave it seems you missed out a [0]. Looking at the structure of json there's a list in key items.
import json
f = open ('json1.json', 'r')
s = f.read()
trailers = json.loads(s)
print(trailers['items'][0]['id']['videoId'])
I've not used json before at all. But it's basically imported in the form of dicts with more dicts, lists etc. Where applicable. At least from my understanding.
So when you do type(trailers) you get type dict. Then you do dict with trailers['key']. If you do type of that, it should also be a dict, if things work correctly. Working through the items in each dict should in the end find your error.
Pythons error says you are trying find the index/indices of a string, which only accepts integers, while you are trying to use a dict. So you need to find out why you are getting a string and not dict when using each argument.
Edit to add an example. If your dict contains a string on key 'item', then you get a string in return, not a new dict which you further can get a dict from. item in the json for example, seem to be a list, with dicts in it. Not a dict itself.
I'm in over my head, trying to parse JSON for my first time and dealing with a multi dimensional array.
{
"secret": "[Hidden]",
"minutes": 20,
"link": "http:\/\/www.1.com",
"bookmark_collection": {
"free_link": {
"name": "#free_link#",
"bookmarks": [
{
"name": "1",
"link": "http:\/\/www.1.com"
},
{
"name": "2",
"link": "http:\/\/2.dk"
},
{
"name": "3",
"link": "http:\/\/www.3.in"
}
]
},
"boarding_pass": {
"name": "Boarding Pass",
"bookmarks": [
{
"name": "1",
"link": "http:\/\/www.1.com\/"
},
{
"name": "2",
"link": "http:\/\/www.2.com\/"
},
{
"name": "3",
"link": "http:\/\/www.3.hk"
}
]
},
"sublinks": {
"name": "sublinks",
"link": [
"http:\/\/www.1.com",
"http:\/\/www.2.com",
"http:\/\/www.3.com"
]
}
}
}
This is divided into 3 parts, the static data on my first dimension (secret, minutes, link) Which i need to get as seperate strings.
Then I need a dictionary per "bookmark collection" which does not have fixed names, so I need the name of them and the links/names of each bookmark.
Then there is the seperate sublinks which is always the same, where I need all the links in a seperate dictionary.
I'm reading about parsing JSON but most of the stuff I find is a simple array put into 1 dictionary.
Does anyone have any good techniques to do this ?
After you parse the JSON, you will end up with a Python dict. So, suppose the above JSON is in a string named input_data:
import json
# This converts from JSON to a python dict
parsed_input = json.loads(input_data)
# Now, all of your static variables are referenceable as keys:
secret = parsed_input['secret']
minutes = parsed_input['minutes']
link = parsed_input['link']
# Plus, you can get your bookmark collection as:
bookmark_collection = parsed_input['bookmark_collection']
# Print a list of names of the bookmark collections...
print bookmark_collection.keys() # Note this contains sublinks, so remove it if needed
# Get the name of the Boarding Pass bookmark:
print bookmark_collection['boarding_pass']['name']
# Print out a list of all bookmark links as:
# Boarding Pass
# * 1: http://www.1.com/
# * 2: http://www.2.com/
# ...
for bookmark_definition in bookmark_collection.values():
# Skip sublinks...
if bookmark_definition['name'] == 'sublinks':
continue
print bookmark_definition['name']
for bookmark in bookmark_definition['bookmarks']:
print " * %(name)s: %(link)s" % bookmark
# Get the sublink definition:
sublinks = parsed_input['bookmark_collection']['sublinks']
# .. and print them
print sublinks['name']
for link in sublinks['link']:
print ' *', link
Hmm, doesn't json.loads do the trick?
For example, if your data is in a file,
import json
text = open('/tmp/mydata.json').read()
d = json.loads(text)
# first level fields
print d['minutes'] # or 'secret' or 'link'
# the names of each of bookmark_collections's items
print d['bookmark_collection'].keys()
# the sublinks section, as a dict
print d['bookmark_collection']['sublinks']
The output of this code (given your sample input above) is:
20
[u'sublinks', u'free_link', u'boarding_pass']
{u'link': [u'http://www.1.com', u'http://www.2.com', u'http://www.3.com'], u'name': u'sublinks'}
Which, I think, gets you what you need?