I am trying to write a .jsonl file that needs to look like this:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
This is my attempt:
import json
import pandas as pd
dt = pd.read_csv('data.csv')
df = pd.DataFrame(dt)
file_name = df['image']
file_caption = df['text']
data = []
for i in range(len(file_name)):
entry = {"file_name": file_name[i], "text": file_caption[i]}
data.append(entry)
json_object = json.dumps(data, indent=4)
# Writing to sample.json
with open("metadata.jsonl", "w") as outfile:
outfile.write(json_object)
But this is the output I get:
[
{
"file_name": "images/image_0.jpg",
"text": "Fattoush Salad with Roasted Potatoes"
},
{
"file_name": "images/image_1.jpg",
"text": "an analysis of self portrayal in novels by virginia woolf A room of one's own study guide contains a biography of virginia woolf, literature essays, quiz questions, major themes, characters, and a full summary and analysis about a room of one's own a room of one's own summary."
},
{
"file_name": "images/image_2.jpg",
"text": "Christmas Comes Early to U.K. Weekly Home Entertainment Chart"
},
{
"file_name": "images/image_3.jpg",
"text": "Amy Garcia Wikipedia a legacy of reform: dorothea dix (1802\u20131887) | states of"
},
{
"file_name": "images/image_4.jpg",
"text": "3D Metal Cornish Harbour Painting"
},
{
"file_name": "images/image_5.jpg",
"text": "\"In this undated photo provided by the New York City Ballet, Robert Fairchild performs in \"\"In Creases\"\" by choreographer Justin Peck which is being performed by the New York City Ballet in New York. (AP Photo/New York City Ballet, Paul Kolnik)\""
},
...
]
I know that its because I am dumping a list so I know where I'm going wrong but how do I create a .jsonl file like the format above?
Don't indent the generated JSON and don't append it to a list. Just write out each line to the file:
import json
import pandas as pd
df = pd.DataFrame([['0001.png', "This is a golden retriever playing with a ball"],
['0002.png', "A german shepherd"],
['0003.png', "One chihuahua"]], columns=['filename','text'])
with open("metadata.jsonl", "w") as outfile:
for file, caption in zip(df['filename'], df['text']):
entry = {"file_name": file, "text": caption}
print(json.dumps(entry), file=outfile)
Output:
{"file_name": "0001.png", "text": "This is a golden retriever playing with a ball"}
{"file_name": "0002.png", "text": "A german shepherd"}
{"file_name": "0003.png", "text": "One chihuahua"}
Related
I want to use Scrapy to extract the titles of different books in a url and output/store them as an array of dictionaries in a json file.
Here is my code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
titles = response.css("article.product_pod h3 a::attr(title)").getall()
for title in titles:
yield {"title": title}
Here is what i put in the terminal:
scrapy crawl books -o books.json
The books.json file is created but is empty.
I checked that I was in the right directory and venv but it still doesn't work.
However:
Earlier, i deployed this spider to scrape the whole html data and write it to a books.html file and everything worked.
Here is my code for this:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
with open("books.html", "wb") as file:
file.write(response.body)
and here is what I put in my terminal:
scrapy crawl books
Any ideas on what I'm doing wrong? Thanks
Edit:
inputing response.css('article.product_pod h3 a::attr(title)').getall()
into the scrapy shell outputs:
['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]
Now run the code.It should work
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
titles = response.css('.product_pod')
for title in titles:
yield {
"title": title.css('h3 a::attr(title)').get()
#"title": title.css('h3 a::text').get()
}
Output:
[
{
"title": "A Light in the Attic"
},
{
"title": "Tipping the Velvet"
},
{
"title": "Soumission"
},
{
"title": "Sharp Objects"
},
{
"title": "Sapiens: A Brief History of Humankind"
},
{
"title": "The Requiem Red"
},
{
"title": "The Dirty Little Secrets of Getting Your Dream Job"
},
{
"title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
},
{
"title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
},
{
"title": "The Black Maria"
},
{
"title": "Starving Hearts (Triangular Trade Trilogy, #1)"
},
{
"title": "Shakespeare's Sonnets"
},
{
"title": "Set Me Free"
},
{
"title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
},
{
"title": "Rip it Up and Start Again"
},
{
"title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
},
{
"title": "Olio"
},
{
"title": "Mesaerion: The Best Science Fiction Stories 1800-1849"
},
{
"title": "Libertarianism for Beginners"
},
{
"title": "It's Only the Himalayas"
}
]
how can i return or select only those parameters that are needed in Python dict format. Not all of the parameters that are being return.
Here is the url we use:
https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20201020&facet=false&sort=newest&api-key=[YOUR_API_KEY]
Here is the response we get:
{
"status": "OK",
"copyright": "Copyright (c) 2020 The New York Times Company. All Rights Reserved.",
"response": {
"docs": [
{
"abstract": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"web_url": "https://www.nytimes.com/2020/10/20/upshot/poll-georgia-biden-trump.html",
"snippet": "Our latest survey shows a shift toward Biden among college-educated white voters, but surprising Trump gains among nonwhite voters.",
"lead_paragraph": "A shift against President Trump among white college-educated voters in Georgia has imperiled Republicans up and down the ballot, according to a New York Times/Siena College survey on Tuesday, as Republicans find themselves deadlocked or trailing in Senate races where their party was once considered the heavy favorite.",
"source": "The New York Times",
"multimedia": [
{
"rank": 0,
"subtype": "xlarge",
"caption": null,
"credit": null,
"type": "image",
"url": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"height": 399,
"width": 600,
"legacy": {
"xlarge": "images/2020/10/20/us/undefined-promo-1603200878027/undefined-promo-1603200878027-articleLarge.jpg",
"xlargewidth": 600,
"xlargeheight": 399
},
"subType": "xlarge",
"crop_name": "articleLarge"
},
..........
How can i only return for example:
web_url and source parameters in Python?
Please help !!!
this is the code i use, but it returns all parameters:
import requests
import os
from pprint import pprint
apikey = os.getenv('VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ', '...')
query_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?q=trump&sort=newest&api-key=VGSDRL9bWiWy70GdCPA4QX8flAsemVGJ"
r = requests.get(query_url)
pprint(r.json())
r = requests.get(query_url)
filtered = [{'web_url': d['web_url'], 'source': d['source']} for d in
r.json()['response']['docs']]
pprint(filtered)
I have a response that I receive from Lobbyview in the form of json. I tried to put it in data frame to access only some variables, but with no success. How can I access only some variables such as the id and the committees in a format exportable to .dta ? Here is the code I have tried.
import requests, json
query = {"naics": "424430"}
results = requests.post('https://www.lobbyview.org/public/api/reports',
data = json.dumps(query))
print(results.json())
import pandas as pd
b = pd.DataFrame(results.json())
_id = data["_id"]
committee = data["_source"]["specific_issues"][0]["bills_by_algo"][0]["committees"]
An observation of the json looks like this:
"_score": 4.421936,
"_type": "object",
"_id": "5EZUMbQp3hGKH8Uq2Vxuke",
"_source":
{
"issue_codes": ["CPT"],
"received": 1214320148,
"client_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"amount": 240000,
"client":
{
"legal_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION",
"naics": null,
"gvkey": null,
"ticker": "Unlisted",
"id": null,
"bvdid": "US131283992L"},
"specific_issues": [
{
"text": "H.R. 34, H.R. 1908, H.R. 2336, H.R. 3093 S. 522, S. 681, S. 1145, S. 1745",
"bills_by_algo": [
{
"titles": ["To amend title 35, United States Code, to provide for patent reform.", "Patent Reform Act of 2007", "Patent Reform Act of 2007", "Patent Reform Act of 2007"],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1176868800,
"type": "HR", "id": "110_HR1908"},
{
"titles": ["To amend title 35, United States Code, relating to the funding of the United States Patent and Trademark Office."],
"top_terms": ["Commerce", "Administrative fees"],
"sponsor":
{
"firstname": "Howard",
"district": 28,
"title": "rep",
"id": 400025
},
"committees": ["House Judiciary"],
"introduced": 1179288000,
"type": "HR",
"id": "110_HR2336"
}],
"gov_entities": ["U.S. House of Representatives", "Patent and Trademark Office (USPTO)", "U.S. Senate", "UNDETERMINED", "U.S. Trade Representative (USTR)"],
"lobbyists": ["Valente, Thomas Silvio", "Wamsley, Herbert C"],
"year": 2007,
"issue": "CPT",
"id": "S4nijtRn9Q5NACAmbqFjvZ"}],
"year": 2007,
"is_latest_amendment": true,
"type": "MID-YEAR AMENDMENT",
"id": "1466CDCD-BA3D-41CE-B7A1-F9566573611A",
"alternate_name": "INTELLECTUAL PROPERTY OWNERS ASSOCIATION"
},
"_index": "collapsed"}```
Since the data that you specified is nested pretty deeply in the JSON-response, you have to loop through it and save it to a list temporarily. To understand the response data better, I would advice you to use some tool to look into the JSON structure, like this online JSON-Viewer. Not every entry in the JSON contains the necessary data, therefore I try to catch the error through a try and except. To make sure that the id and committees are matched correctly, I chose to add them as small dicts to the list. This list can then be read into Pandas with ease. Saving to .dta requires you to convert the lists inside the committees column to strings, instead you might also want to save as .csv for a more generally usable format.
import requests, json
import pandas as pd
query = {"naics": "424430"}
results = requests.post(
"https://www.lobbyview.org/public/api/reports", data=json.dumps(query)
)
json_response = results.json()["result"]
# to save the JSON response
# with open("data.json", "w") as outfile:
# json.dump(results.json()["result"], outfile)
resulting_data = []
# loop through the response
for data in json_response:
# try to find entries with specific issues, bills_by_algo and committees
try:
# loop through the special issues
for special_issue in data["specific_issues"]:
_id = special_issue["id"]
# loop through the bills_by_algo's
for x in special_issue["bills_by_algo"]:
# append the id and committees in a dict
resulting_data.append(({"id": _id, "committees": x["committees"]}))
except KeyError as e:
print(e, "not found in entry.")
continue
# create a DataFrame
df = pd.DataFrame(resulting_data)
# export of list objects in the column is not supported by .dta, therefore we convert
# to strings with ";" as delimiter
df["committees"] = ["; ".join(map(str, l)) for l in df["committees"]]
print(df)
df.to_stata("result.dta")
Results in
id committees
0 D8BxG5664FFb8AVc6KTphJ House Judiciary
1 D8BxG5664FFb8AVc6KTphJ Senate Judiciary
2 8XQE5wu3mU7qvVPDpUWaGP House Agriculture
3 8XQE5wu3mU7qvVPDpUWaGP Senate Agriculture, Nutrition, and Forestry
4 kzZRLAHdMK4YCUQtQAdCPY House Agriculture
.. ... ...
406 ZxXooeLGVAKec9W2i32hL5 House Agriculture
407 ZxXooeLGVAKec9W2i32hL5 Senate Agriculture, Nutrition, and Forestry; H...
408 ZxXooeLGVAKec9W2i32hL5 House Appropriations; Senate Appropriations
409 ahmmafKLfRP8wZay9o8GRf House Agriculture
410 ahmmafKLfRP8wZay9o8GRf Senate Agriculture, Nutrition, and Forestry
[411 rows x 2 columns]
I have two different txt files-
1)one with just jpg files
2)another with five cations for each image in the following format:
1000268201_693b08cb0e.jpg#0 A child in a pink dress is climbing up a set of stairs in an entry way .
1000268201_693b08cb0e.jpg#1 A girl going into a wooden building .
1000268201_693b08cb0e.jpg#2 A little girl climbing into a wooden playhouse .
1000268201_693b08cb0e.jpg#3 A little girl climbing the stairs to her playhouse .
1000268201_693b08cb0e.jpg#4 A little girl in a pink dress going into a wooden cabin .
where 0,1,2,3,4 are caption id's for same image.
I want create something like this for all data:-
{"images":[
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "A girl is stretched out in shallow water",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "A girl is stretched out in water",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 3",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 4",
},
{"imgurl": "static/img/667626_18933d713e.jpg",
"description": "description 5",
}
]
}
I am looking for a python code to do this at once ,for whole dataset.
It is very frustrating to do it manually.
import glob
image_database = glob.glob('/static/img/img.jpg')
dataset_list = []
for image in image_database:
image.show()
print 'Enter the description':
description = input()
img_data = {}
img_data['imgurl'] = image
img_data['description'] = description
dataset_list.append(img_data)
dataset_json = {}
dataset_json['images'] = dataset_list
json.dump(dataset_json, open('custom_dataset.json','wb')
I have placed the image.txt under static folder,what i am currently doing is to enter the description manually.
Parse the file and generate your json.
import json
filepath = 'img.txt'
asset_path = 'static/img/'
images = []
with open(filepath) as file:
for line in file:
split = line.split("#")
image = {
"imgurl": asset_path+split[0],
"description": split[1].strip('\n')
}
images.append(image)
with open('data.json', 'w') as outfile:
json.dump( { "images" : images }, outfile)
I collected public course data from Udemy and put it all in a json file. Each course has an identifier number under which all the data is stored. I can perfectly list out any details I want, except for these identifier numbers.
How can I list out these numbers themselves? Thanks.
{
"153318":
{
"lectures data": "31 lectures, 5 hours video",
"instructor work": "Academy Of Technical Courses, Grow Your Skills Today",
"title": "Oracle Applications R12 Order Management and Pricing",
"promotional price": "$19",
"price": "$20",
"link": "https://www.udemy.com/oracle-applications-r12-order-management-and-pricing/",
"instructor": "Parallel Branch Inc"
},
"616990":
{
"lectures data": "24 lectures, 1.5 hours video",
"instructor work": "Learning Sans Location",
"title": "Cloud Computing Development Essentials",
"promotional price": "$19",
"price": "$20",
"link": "https://www.udemy.com/cloud-computing-development-essentials/",
"instructor": "Destin Learning"
}
}
You want the keys of that dictionnary.
import json
with open('course.json') as json_file:
course=json.load(json_file)
print course.keys()
giving :
[u'616990', u'153318']
Parse the json into a python dict, then loop over the keys
parsed = json.loads(input)
for key in parsed.keys():
print(key)