I wrote a web scraping script and it is working great. I am trying to write the scraped data to json file but i failed.
this is my snippet:
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
json_job = json.dumps(job_dict)
with open('data.json', 'a') as f:
json.dump(json_job, f)
if __name__ == '__main__':
urls = ['url1', 'url2', 'url3', 'url4']
for url in urls:
scrape_post_info(url)
ignore two function i called inside the function, problem not with them
My problem only is writing to json.
Currently i am getting the scraped data like this below and there are wrong format
data.json are below:
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
}
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
}
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
}
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
}
but it should be like these:
[
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
},
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
},
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
},
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
},
]
I am not getting exactly why i am not getting data in json file in proper formate..
Can anyone help me in this?
You can try this code to solve your problem.
you will get exact file as you expected above, following is the code:
import json
def scrape_post_info(url, f):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
json_job = json.dumps(job_dict)
f.seek(0)
txt = f.readline()
if txt.endswith("}"):
f.write(",")
f.write(json_job)
if __name__ == '__main__':
urls = ['url1', 'url2', 'url3', 'url4']
with open('data.json', 'r+') as f:
f.write("[")
for url in urls:
scrape_post_info(url,f)
f.write("]")
Related
I have a script that takes an ID from a JSON file, adds it to a URL for an API request. The aim is to have a loop run through the 1000-ish ids and preduce one JSON file with all the information contained within.
The current code calls the first request and creates and populates the JSON file, but when run in a loop it throws a key error.
import json
import requests
fname = "NewdataTest.json"
def write_json(data, fname):
fname = "NewdataTest.json"
with open(fname, "w") as f:
json.dump(data, f, indent = 4)
with open (fname) as json_file:
data = json.load(json_file)
temp = data[0]
#print(newData)
y = newData
data.append(y)
# Read test.json to get tmdb IDs
tmdb_ids = []
with open('test.json', 'r') as json_fp:
imdb_info = json.load(json_fp)
tmdb_ids = [movie_info['tmdb_id'] for movies_chunk in imdb_info for movie_index, movie_info in movies_chunk.items()]
# Add IDs to API call URL
for tmdb_id in tmdb_ids:
print("https://api.themoviedb.org/3/movie/" + str(tmdb_id) + "?api_key=****")
# Send API Call
response_API = requests.get("https://api.themoviedb.org/3/movie/" + str(tmdb_id) + "?api_key=****")
# Check API Call Status
print(response_API.status_code)
write_json((response_API.json()), "NewdataTest.json")
The error is in this line "temp = data[0]" I have tried printing the keys for data, nothing. At this point, I have no idea where I am with this as I have hacked it about it barely resembles anything like a cohesive piece of code. My aim was to make a simple function to get the data from the JSON, one to produce the API call URLs, and one to write the results to the new JSON.
Example of API reponse JSON:
{
"adult": false,
"backdrop_path": "/e1cC9muSRtAHVtF5GJtKAfATYIT.jpg",
"belongs_to_collection": null,
"budget": 0,
"genres": [
{
"id": 10749,
"name": "Romance"
},
{
"id": 35,
"name": "Comedy"
}
],
"homepage": "",
"id": 1063242,
"imdb_id": "tt24640474",
"original_language": "fr",
"original_title": "Disconnect: The Wedding Planner",
"overview": "After falling victim to a scam, a desperate man races the clock as he attempts to plan a luxurious destination wedding for an important investor.",
"popularity": 34.201,
"poster_path": "/tGmCxGkVMOqig2TrbXAsE9dOVvX.jpg",
"production_companies": [],
"production_countries": [
{
"iso_3166_1": "KE",
"name": "Kenya"
},
{
"iso_3166_1": "NG",
"name": "Nigeria"
}
],
"release_date": "2023-01-13",
"revenue": 0,
"runtime": 107,
"spoken_languages": [
{
"english_name": "English",
"iso_639_1": "en",
"name": "English"
},
{
"english_name": "Afrikaans",
"iso_639_1": "af",
"name": "Afrikaans"
}
],
"status": "Released",
"tagline": "",
"title": "Disconnect: The Wedding Planner",
"video": false,
"vote_average": 5.8,
"vote_count": 3
}
You can store all results from the API calls into a list and then save this list in Json format into a file. For example:
#...
all_data = []
for tmdb_id in tmdb_ids:
print("https://api.themoviedb.org/3/movie/" + str(tmdb_id) + "?api_key=****")
# Send API Call
response_API = requests.get("https://api.themoviedb.org/3/movie/" + str(tmdb_id) + "?api_key=****")
# Check API Call Status
print(response_API.status_code)
if response_API.status_code == 200:
# store the Json data in a list:
all_data.append(response_API.json())
# write the list to file
with open('output.json', 'w') as f_out:
json.dump(all_data, f_out, indent=4)
This will produce output.json with all responses in Json format.
I extracted the following script from html using beautiful-soup:
<script>
dataLayer =[{
"pageTitle": "PRODUCT: Macculloch Parka Print( 9512MP )",
"pageCategory": "shop-mens-parkas",
"visitorLoginState": "Guest",
"EmployeeLoginState": false,
"customerEmail": "null",
"customerOrders": "null",
"customerValue": "0",
"Country": "CA",
"State": "ON",
"ecommerce": {
"currencyCode": "CAD",
"detail": {
"actionField": {
"list": "Product Category / Search Results"
},
"products": [
{
"name": "Macculloch Parka Print",
"id": "9512MP",
"price": 1295,
"brand": "Canada Goose",
"category": "shop-mens-parkas"}]}}}];</script>
I want to extract the information related to the product (name, id, price and brand) as a dataframe. Is there a way to do it without using regex?
You can use regex to get json and parse:
import json
import re
data = json.loads(re.search(r"dataLayer =(.*);", d, re.DOTALL).group(1))
products = data[0]["ecommerce"]["detail"]["products"]
product_name = products[0]["name"]
product_id = products[0]["id"]
product_price = products[0]["price"]
product_brand = products[0]["brand"]
product_category = products[0]["category"]
Here is a temporary solution, contingent on receiving more information on the format of the data.
import re
import json
def get_datalayer_json(raw_script_tag: str):
parser_re = r"<script>\s*dataLayer =(.*);\s*</script>"
parser_result = re.match(parser_re, raw_script_tag.strip(), re.DOTALL)
if parser_result is None:
return None
else:
return json.loads(parser_result.group(1))
I write a script for web scraping and it is successfully scraping data. Only problem is with exporting data to JSON file
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
#here json machanism
json_job = json.dumps(job_dict)
with open('data.json', 'r+') as f:
f.write("[")
f.seek(0)
f.write(json_job)
txt = f.readline()
if txt.endswith("}"):
f.write(",")
def crawl_web(url):
while True:
post_url = get_post_url(url)
for urls in post_url:
urls = urls
scrape_post_info(urls)
# Execute the main fuction 'crawl_web'
if __name__ == '__main__':
crawl_web('www.examp....com')
The data is exported to JSON but it is not proper format of JSON. I am expecting the data should look like:
[
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
},
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
},
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
},
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
},
]
How can I achieve this?
How about:
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
return {"title": title, "Description": description, "url": post_url}
def crawl_web(url):
while True:
jobs = []
post_urls = get_post_url(url)
for url in post_urls:
jobs.append(scrape_post_info(url))
with open("data.json", "w") as f:
json.dumps(jobs)
# Execute the main fuction 'crawl_web'
if __name__ == "__main__":
crawl_web("www.examp....com")
Note that this will rewrite your entire file on each iteration of "post_urls", so it might become quite slow with large files and slow I/O.
Depending on how long your job is running and how much memory you have, you might want to move the file writing out of the for loop, and only write it out once.
Note: if you really want to write JSON streaming, you might want to look at something like this package: https://pypi.org/project/jsonstreams/, however I'd suggest to choose another format such as CSV that is much more well suited to streaming writes.
I am trying to extract the Rating from https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/ in order to extract the "ratingValue" and "alternateName" fields from the HTML code:
<script type=application/ld+json>{
"#context": "http://schema.org",
"#type": "ClaimReview",
"datePublished": "2019-01-03 ",
"url": "https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/",
"author": {
"#type": "Organization",
"url": "https://www.truthorfiction.com/",
"image": "https://dn.truthorfiction.com/wp-content/uploads/2018/10/25032229/truth-or-fiction-logo-tagline.png",
"sameAs": "https://twitter.com/whatstruecom"
},
"claimReviewed": "More Americans die every year from a lack of affordable healthcare than by terrorism or at the hands of undocumented immigrants.",
"reviewRating": {
"#type": "Rating",
"ratingValue": -1,
"worstRating":-1,
"bestRating": -1,
"alternateName": "True"
},
"itemReviewed": {
"#type": "CreativeWork",
"author": {
"#type": "Person",
"name": "Person",
"jobTitle": "",
"image": "",
"sameAs": [
""
]
},
"datePublished": "",
"name": ""
}
}</script>
I have tried to do that using the following code:
import json
from bs4 import BeautifulSoup
slink = 'https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/'
response = http.request('GET', slink)
soup = BeautifulSoup(response.data)
tmp = json.loads(soup.find('script', type='application/ld+json').text)
However, tmp instead shows a dictionary of the 'application/ld+json' item from the bit preceding the ratings that I would like to extract, and I was wondering how to cycle or loop to the relevant part of the script where the ratings are stored.
it has 2 <script type=application/ld+json> you can select second index from find_all()
tmp = json.loads(soup.find_all('script', type='application/ld+json')[1].text)
or loop and search if it contains the string
tmp = None
for ldjson in soup.find_all('script', type='application/ld+json'):
if 'ratingValue' in ldjson.text:
tmp = json.loads(ldjson.text)
You need to access the element using the keys.
rating_value = tmp['reviewRating']['ratingValue'] # -1
alternate_name = tmp['reviewRating']['alternateName'] # 'True'
or
review_rating = tmp['reviewRating']
rating_value = review_rating['ratingValue'] # -1
alternate_name = review_rating['alternateName'] # 'True'
I have the following Python code:
import requests
import json
from bs4 import BeautifulSoup
url = requests.get('https://www.perfectimprints.com/custom-promos/20492/Beach-Balls.html')
source = BeautifulSoup(url.text, 'html.parser')
products = source.find_all('div', class_="product_wrapper")
def get_product_details(product):
product_name = product.find('div', class_="product_name").a.text
sku = product.find('div', class_="product_sku").text
product_link = product.find('div', class_="product_image_wrapper").find("a")["href"]
src = product.find('div', class_="product_image_wrapper").find('a').find("img")["src"]
return {
"title": product_name,
"link": product_link,
"sku": sku,
"src": src
}
all_products = [get_product_details(product) for product in products]
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
print("Success")
This code works perfectly as written. The problem is I want the structure instead of
[
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
I want it to be:
{
"items": [
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
}
Here's a link to what I have working in Repl.it, just so you don't have to set up your own: https://repl.it/repls/AttractiveDimpledTheory
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Here you're dumping your all_products list directly to JSON:
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
The JSON you want just has that list in an object. Something like
with open("products.json", "w") as write_file:
json.dump({'items': all_products}, write_file)
should do what you want.
Generally speaking there's a 1:1 relationship between your Python data structure and the JSON it generates. If you build the right Python data structure you'll get the right JSON. Here we're using a dict (which maps to a JSON object) to wrap your existing list (which maps to a JSON array).
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Assuming you also want to remove spaces, you can just use str.strip(), which strips whitespace by default:
return {
"title": product_name,
"link": product_link,
"sku": sku.strip(), # <-- here
"src": src
}