I have two different txt files-
1)one with just jpg files
2)another with five cations for each image in the following format:
1000268201_693b08cb0e.jpg#0 A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1 A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2 A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg#3 A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg#4 A little girl in a pink dress going into a wooden cabin .
where 0,1,2,3,4 are caption id's for same image.
I want create something like this for all data:-
{"images":[
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "A girl is stretched out in shallow water",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "A girl is stretched out in water",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 3",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 4",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 5",
}
]
}
I am looking for a python code to do this at once ,for whole dataset.
It is very frustrating to do it manually.
import glob
image_database = glob.glob('/static/img/img.jpg')
dataset_list = []
for image in image_database:
image.show()
print 'Enter the description':
description = input()
img_data = {}
img_data['imgurl'] = image
img_data['description'] = description
dataset_list.append(img_data)
dataset_json = {}
dataset_json['images'] = dataset_list
json.dump(dataset_json, open('custom_dataset.json','wb')
I have placed the image.txt under static folder,what i am currently doing is to enter the description manually.
Parse the file and generate your json.
import json
filepath = 'img.txt'
asset_path = 'static/img/'
images = []
with open(filepath) as file:
for line in file:
split = line.split("#")
image = {
"imgurl": asset_path+split[0],
"description": split[1].strip('\n')
}
images.append(image)
with open('data.json', 'w') as outfile:
json.dump( { "images" : images }, outfile)
Related
I am trying to write a .jsonl file that needs to look like this:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
This is my attempt:
import json
import pandas as pd
dt = pd.read_csv('data.csv')
df = pd.DataFrame(dt)
file_name = df['image']
file_caption = df['text']
data = []
for i in range(len(file_name)):
entry = {"file_name": file_name[i], "text": file_caption[i]}
data.append(entry)
json_object = json.dumps(data, indent=4)
# Writing to sample.json
with open("metadata.jsonl", "w") as outfile:
outfile.write(json_object)
But this is the output I get:
[
{
"file_name": "images/image_0.jpg",
"text": "Fattoush Salad with Roasted Potatoes"
},
{
"file_name": "images/image_1.jpg",
"text": "an analysis of self portrayal in novels by virginia woolf A room of one's own study guide contains a biography of virginia woolf, literature essays, quiz questions, major themes, characters, and a full summary and analysis about a room of one's own a room of one's own summary."
},
{
"file_name": "images/image_2.jpg",
"text": "Christmas Comes Early to U.K. Weekly Home Entertainment Chart"
},
{
"file_name": "images/image_3.jpg",
"text": "Amy Garcia Wikipedia a legacy of reform: dorothea dix (1802\u20131887) | states of"
},
{
"file_name": "images/image_4.jpg",
"text": "3D Metal Cornish Harbour Painting"
},
{
"file_name": "images/image_5.jpg",
"text": "\"In this undated photo provided by the New York City Ballet, Robert Fairchild performs in \"\"In Creases\"\" by choreographer Justin Peck which is being performed by the New York City Ballet in New York. (AP Photo/New York City Ballet, Paul Kolnik)\""
},
...
]
I know that its because I am dumping a list so I know where I'm going wrong but how do I create a .jsonl file like the format above?
Don't indent the generated JSON and don't append it to a list. Just write out each line to the file:
import json
import pandas as pd
df = pd.DataFrame([['0001.png', "This is a golden retriever playing with a ball"],
['0002.png', "A german shepherd"],
['0003.png', "One chihuahua"]], columns=['filename','text'])
with open("metadata.jsonl", "w") as outfile:
for file, caption in zip(df['filename'], df['text']):
entry = {"file_name": file, "text": caption}
print(json.dumps(entry), file=outfile)
Output:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
I want to use Scrapy to extract the titles of different books in a url and output/store them as an array of dictionaries in a json file.
Here is my code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
titles = response.css("article.product_pod h3 a::attr(title)").getall()
for title in titles:
yield {"title": title}
Here is what i put in the terminal:
scrapy crawl books -o books.json
The books.json file is created but is empty.
I checked that I was in the right directory and venv but it still doesn't work.
However:
Earlier, i deployed this spider to scrape the whole html data and write it to a books.html file and everything worked.
Here is my code for this:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
with open("books.html", "wb") as file:
file.write(response.body)
and here is what I put in my terminal:
scrapy crawl books
Any ideas on what I'm doing wrong? Thanks
Edit:
inputing response.css('article.product_pod h3 a::attr(title)').getall()
into the scrapy shell outputs:
['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]
Now run the code.It should work
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
titles = response.css('.product_pod')
for title in titles:
yield {
"title": title.css('h3 a::attr(title)').get()
#"title": title.css('h3 a::text').get()
}
Output:
[
{
"title": "A Light in the Attic"
},
{
"title": "Tipping the Velvet"
},
{
"title": "Soumission"
},
{
"title": "Sharp Objects"
},
{
"title": "Sapiens: A Brief History of Humankind"
},
{
"title": "The Requiem Red"
},
{
"title": "The Dirty Little Secrets of Getting Your Dream Job"
},
{
"title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
},
{
"title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
},
{
"title": "The Black Maria"
},
{
"title": "Starving Hearts (Triangular Trade Trilogy, #1)"
},
{
"title": "Shakespeare's Sonnets"
},
{
"title": "Set Me Free"
},
{
"title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
},
{
"title": "Rip it Up and Start Again"
},
{
"title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
},
{
"title": "Olio"
},
{
"title": "Mesaerion: The Best Science Fiction Stories 1800-1849"
},
{
"title": "Libertarianism for Beginners"
},
{
"title": "It's Only the Himalayas"
}
]
I have weather alert data like
"alerts": [
{
"description": "There is a risk of frost (Level 1 of 2).\nMinimum temperature: ~ -2 \u00b0C",
"end": 1612522800,
"event": "frost",
"sender_name": "DWD / Nationales Warnzentrum Offenbach",
"start": 1612450800
},
{
"description": "There is a risk of widespread icy surfaces (Level 1 of 3).\ncause: widespread ice formation or traces of snow",
"end": 1612515600,
"event": "widespread icy surfaces",
"sender_name": "DWD / Nationales Warnzentrum Offenbach",
"start": 1612450800
},
{
"description": "Es treten Windb\u00f6en mit Geschwindigkeiten um 55 km/h (15m/s, 30kn, Bft 7) aus \u00f6stlicher Richtung auf. In exponierten Lagen muss mit Sturmb\u00f6en bis 65 km/h (18m/s, 35kn, Bft 8) gerechnet werden.",
"end": 1612587600,
"event": "WINDB\u00d6EN",
"sender_name": "DWD / Nationales Warnzentrum Offenbach",
"start": 1612522800
},
Now I want to add to every single alert dict a key-value-pair which contains the detection of language from the 'description' field. I tried that but can't get the right syntax...
import json
from langdetect import detect
with open("kiel.json", 'r') as f:
data = json.loads(f.read())
data['ADDED_KEY'] = 'ADDED_VALUE'
#'ADDED_KEY' = 'lang' - should be added as a data field to EVERY alert
#'ADDED_VALUE' = 'en' or 'ger' - should be the detected language [via detect()] from data field 'description' of every alert
with open("kiel.json", 'w') as f:
f.write(json.dumps(data, sort_keys=True, indent=4, separators=(',', ': ')))
Actually I just got Adding at the whole file like:
{
"ADDED_KEY": "ADDED_VALUE",
"alerts": [
{
"description": "There is a risk of frost (Level 1 of 2).\nMinimum temperature: ~ -2 \u00b0C",
"end": 1612522800,
"event": "frost",
"sender_name": "DWD / Nationales Warnzentrum Offenbach",
"start": 1612450800
},
Can you help me to complete the code in the right way with right accessing of the right data fields please?
Further:
Now the case appears, that 'alerts' is not included as datafield (for example when no alert-data is transmitted because the weather is fine) - I althrought want to generate that JSON. I tried:
for item in data['alerts']:
if 'alerts' not in data:
continue
else:
item['lang'] = detect(item['description'])
But if there is no 'alerts' datafield I got
for item in data['alerts']:
KeyError: 'alerts'
How can I solve this? Is "continue" not the right task? Or have I to change if- and for-loop?
Thx again!
Following works. Iterate over alert and add the key/value as you mentioned.
import json
from langdetect import detect
with open("kiel.json", 'r') as f:
data = json.loads(f.read())
for item in data['alerts']:
item['lang'] = detect(item['description'])
#'ADDED_KEY' = 'lang' - should be added as a data field to EVERY alert
#'ADDED_VALUE' = 'en' or 'ger' - should be the detected language [via detect()] from data field 'description' of every alert
with open("kiel.json", 'w') as f:
f.write(json.dumps(data, sort_keys=True, indent=4, separators=(',', ': ')))
You just need to iterate over the dictionary key alerts and add key, value to every item(which is a dictionary).
for item in data["alerts"]:
item["ADDED_KEY"] = "ADDED_VALUE"
I wrote a web scraping script and it is working great. I am trying to write the scraped data to json file but i failed.
this is my snippet:
def scrape_post_info(url):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
json_job = json.dumps(job_dict)
with open('data.json', 'a') as f:
json.dump(json_job, f)
if __name__ == '__main__':
urls = ['url1', 'url2', 'url3', 'url4']
for url in urls:
scrape_post_info(url)
ignore two function i called inside the function, problem not with them
My problem only is writing to json.
Currently i am getting the scraped data like this below and there are wrong format
data.json are below:
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
}
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
}
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
}
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
}
but it should be like these:
[
{
"title": "this is title",
"Description": " Fendi is an Italian luxury labelarin. ",
"url": "https:/~"
},
{
"title": " - Furrocious Elegant Style",
"Description": " the Italian luxare vast. ",
"url": "https://www.s"
},
{
"title": "Rome, Fountains and Fendi Sunglasses",
"Description": " Fendi started off as a store. ",
"url": "https://www.~"
},
{
"title": "Tipsnglasses",
"Description": "Whether irregular orn season.",
"url": "https://www.sooic"
},
]
I am not getting exactly why i am not getting data in json file in proper formate..
Can anyone help me in this?
You can try this code to solve your problem.
you will get exact file as you expected above, following is the code:
import json
def scrape_post_info(url, f):
content = get_page_content(url)
title, description, post_url = get_post_details(content, url)
job_dict = {}
job_dict['title'] = title
job_dict['Description'] = description
job_dict['url'] = post_url
json_job = json.dumps(job_dict)
f.seek(0)
txt = f.readline()
if txt.endswith("}"):
f.write(",")
f.write(json_job)
if __name__ == '__main__':
urls = ['url1', 'url2', 'url3', 'url4']
with open('data.json', 'r+') as f:
f.write("[")
for url in urls:
scrape_post_info(url,f)
f.write("]")
I have the following Python code:
import requests
import json
from bs4 import BeautifulSoup
url = requests.get('https://www.perfectimprints.com/custom-promos/20492/Beach-Balls.html')
source = BeautifulSoup(url.text, 'html.parser')
products = source.find_all('div', class_="product_wrapper")
def get_product_details(product):
product_name = product.find('div', class_="product_name").a.text
sku = product.find('div', class_="product_sku").text
product_link = product.find('div', class_="product_image_wrapper").find("a")["href"]
src = product.find('div', class_="product_image_wrapper").find('a').find("img")["src"]
return {
"title": product_name,
"link": product_link,
"sku": sku,
"src": src
}
all_products = [get_product_details(product) for product in products]
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
print("Success")
This code works perfectly as written. The problem is I want the structure instead of
[
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
I want it to be:
{
"items": [
{
"title": "12\" Beach Ball",
"link": "/promos/PI-255-751/12-Beach-Ball.html?cid=20492",
"sku": " \n\t\t\t\t#PI-255-751\n\t\t\t",
"src": "https://12f598f3b6e7e912e4cd-a182d9508ed57781ad8837d0e4f7a945.ssl.cf5.rackcdn.com/thumb/751_group.jpg"
},
]
}
Here's a link to what I have working in Repl.it, just so you don't have to set up your own: https://repl.it/repls/AttractiveDimpledTheory
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Here you're dumping your all_products list directly to JSON:
with open("products.json", "w") as write_file:
json.dump(all_products, write_file)
The JSON you want just has that list in an object. Something like
with open("products.json", "w") as write_file:
json.dump({'items': all_products}, write_file)
should do what you want.
Generally speaking there's a 1:1 relationship between your Python data structure and the JSON it generates. If you build the right Python data structure you'll get the right JSON. Here we're using a dict (which maps to a JSON object) to wrap your existing list (which maps to a JSON array).
Side note: Would love to also be able to remove all of the \n and \t in the skus if possible.
Assuming you also want to remove spaces, you can just use str.strip(), which strips whitespace by default:
return {
"title": product_name,
"link": product_link,
"sku": sku.strip(), # <-- here
"src": src
}