How to get information from the html script below? - python

I extracted the following script from html using beautiful-soup:
<script>
dataLayer =[{
"pageTitle": "PRODUCT: Macculloch Parka Print( 9512MP )",
"pageCategory": "shop-mens-parkas",
"visitorLoginState": "Guest",
"EmployeeLoginState": false,
"customerEmail": "null",
"customerOrders": "null",
"customerValue": "0",
"Country": "CA",
"State": "ON",
"ecommerce": {
"currencyCode": "CAD",
"detail": {
"actionField": {
"list": "Product Category / Search Results"
},
"products": [
{
"name": "Macculloch Parka Print",
"id": "9512MP",
"price": 1295,
"brand": "Canada Goose",
"category": "shop-mens-parkas"}]}}}];</script>
I want to extract the information related to the product (name, id, price and brand) as a dataframe. Is there a way to do it without using regex?

You can use regex to get json and parse:
import json
import re
data = json.loads(re.search(r"dataLayer =(.*);", d, re.DOTALL).group(1))
products = data[0]["ecommerce"]["detail"]["products"]
product_name = products[0]["name"]
product_id = products[0]["id"]
product_price = products[0]["price"]
product_brand = products[0]["brand"]
product_category = products[0]["category"]

Here is a temporary solution, contingent on receiving more information on the format of the data.
import re
import json
def get_datalayer_json(raw_script_tag: str):
parser_re = r"<script>\s*dataLayer =(.*);\s*</script>"
parser_result = re.match(parser_re, raw_script_tag.strip(), re.DOTALL)
if parser_result is None:
return None
else:
return json.loads(parser_result.group(1))

Related

Create a nested data dictionary in Python

I have the data as below
{
"employeealias": "101613177",
"firstname": "Lion",
"lastname": "King",
"date": "2022-04-21",
"type": "Thoughtful Intake",
"subject": "Email: From You Success Coach"
}
{
"employeealias": "101613177",
"firstname": "Lion",
"lastname": "King",
"date": "2022-04-21",
"type": null,
"subject": "Call- CDL options & career assessment"
}
I need to create a dictionary like the below:
You have to create new dictionary with list and use for-loop to check if exists employeealias, firstname, lastname to add other information to sublist. If item doesn't exist then you have to create new item with employeealias, firstname, lastname and other information.
data = [
{"employeealias":"101613177","firstname":"Lion","lastname":"King","date":"2022-04-21","type":"Thoughtful Intake","subject":"Email: From You Success Coach"},
{"employeealias":"101613177","firstname":"Lion","lastname":"King","date":"2022-04-21","type":"null","subject":"Call- CDL options & career assessment"},
]
result = {'interactions': []}
for row in data:
found = False
for item in result['interactions']:
if (row["employeealias"] == item["employeealias"]
and row["firstname"] == item["firstname"]
and row["lastname"] == item["lastname"]):
item["activity"].append({
"date": row["date"],
"subject": row["subject"],
"type": row["type"],
})
found = True
break
if not found:
result['interactions'].append({
"employeealias": row["employeealias"],
"firstname": row["firstname"],
"lastname": row["lastname"],
"activity": [{
"date": row["date"],
"subject": row["subject"],
"type": row["type"],
}]
})
print(result)
EDIT:
You read lines as normal text but you have to convert text to dictonary using module json
import json
data = []
with open("/Users/Downloads/amazon_activity_feed_0005_part_00.json") as a_file:
for line in a_file:
line = line.strip()
dictionary = json.loads(line)
data.append(dictionary)
print(data)
You can create a nested dictionary inside Python like this:
student = {name : "Suman", Age = 20, gender: "male",{class : 11, roll no: 12}}

Python - How to retrieve element from json

Aloha,
My python routine will retrieve json from site, then check the file and download another json given the first answer and eventually download a zip.
The first json file gives information about doc.
Here's an example :
[
{
"id": "d9789918772f935b2d686f523d066a7b",
"originalName": "130010259_AC2_R44_20200101",
"type": "SUP",
"status": "document.deleted",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2020-06-25T14:56:27+02:00",
"updateDate": "2021-01-19T14:33:35+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20200101",
"legalControlStatus": 101
},
{
"id": "6a9013bdde6acfa632861aeb1a02942b",
"originalName": "130010259_AC2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_AC2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.4212881,
47.6171589,
8.1598899,
50.1338684
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-18T16:37:01+01:00",
"updateDate": "2021-01-19T14:33:29+01:00",
"fileIdentifier": "SUP-AC2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "efd51feaf35b12248966cb82f603e403",
"originalName": "130010259_PM2_R44_20210101",
"type": "SUP",
"status": "document.production",
"legalStatus": "APPROVED",
"name": "130010259_SUP_R44_PM2",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
3.6535762,
47.665021,
7.9509455,
49.907347
],
"documentSource": "UPLOAD",
"uploadDate": "2021-01-28T09:52:31+01:00",
"updateDate": "2021-01-28T18:53:34+01:00",
"fileIdentifier": "SUP-PM2-R44-130010259-20210101",
"legalControlStatus": 101
},
{
"id": "2e1b6104fdc09c84077d54fd9e74a7a7",
"originalName": "444619258_I4_R44_20210211",
"type": "SUP",
"status": "document.pre_production",
"legalStatus": "APPROVED",
"name": "444619258_SUP_R44_I4",
"grid": {
"name": "R44",
"title": "GRAND EST"
},
"bbox": [
2.8698336,
47.3373246,
8.0881368,
50.3796449
],
"documentSource": "UPLOAD",
"uploadDate": "2021-04-19T10:20:20+02:00",
"updateDate": "2021-04-19T14:46:21+02:00",
"fileIdentifier": "SUP-I4-R44-444619258-20210211",
"legalControlStatus": 100
}
]
What I try to do is to retrieve "id" from this json file. (ex. "id": "2e1b6104fdc09c84077d54fd9e74a7a7",)
I've tried
import json
from jsonpath_rw import jsonpath, parse
import jsonpath_rw_ext as jp
with open('C:/temp/gpu/SUP/20210419/SUPGE.json') as f:
d = json.load(f)
data = json.dumps(d)
print("oriName: {}".format( jp.match1("$.id[*]",data) ) )
It doesn't work In fact, I'm not sure how jsonpath-rw is intended to work. Thankfully there was this blogpost But I'm still stuck.
Does anyone have a clue ?
With the id, I'll be able to download another json and in this json there'll be an archiveUrl to get the zipfile.
Thanks in advance.
import json
file = open('SUPGE.json')
with file as f:
d = json.load(f)
for i in d:
print(i.get('id'))
this will give you id only.
d9789918772f935b2d686f523d066a7b
6a9013bdde6acfa632861aeb1a02942b
efd51feaf35b12248966cb82f603e403
2e1b6104fdc09c84077d54fd9e74a7a7
Ok.
Here's what I've done.
import json
import urllib
# not sure it's the best way to load json from url, but it works fine
# and I could test most of code if needed.
def getResponse(url):
operUrl = urllib.request.urlopen(url)
if(operUrl.getcode()==200):
data = operUrl.read()
jsonData = json.loads(data)
else:
print("Erreur reçue", operUrl.getcode())
return jsonData
# Here I get the json from the url. *
# That part will be in the final script a parameter,
# because I got lot of territory to control
d = getResponse('https://www.geoportail-urbanisme.gouv.fr/api/document?documentFamily=SUP&grid=R44&legalStatus=APPROVED')
for i in d:
if i['status'] == 'document.production' :
print('id du doc en production :',i.get('id'))
# here we parse the id to fetch the whole document.
# Same server, same API but different url
_URL = 'https://www.geoportail-urbanisme.gouv.fr/api/document/' + i.get('id')+'/details'
d2 = getResponse(_URL)
print('archive',d2['archiveUrl'])
urllib.request.urlretrieve(d2['archiveUrl'], 'c:/temp/gpu/SUP/'+d2['metadata']+'.zip' )
# I used wget in the past and loved the progression bar.
# Maybe I'd switch to wget because of it.
# Works fine.
Thanks for your answer. I'm delighted to see that even with only the json library you could do amazing things. Just normal stuff. But amazing.
Feel free to comment if you think I've missed smthg.

Saving a JSON list - Django

I am working with a JSON request.get that returns a list. I will like to save each individual object in the response to my models so I did this:
in views.py:
def save_ram_api(request):
r = requests.get('https://ramb.com/ciss-api/v1/')
# data = json.loads(r)
data = r.json()
for x in data:
title = x["title"]
ramyo_donotuse = x["ramyo"]
date = x["date"]
thumbnail = x["thumbnail"]
home_team_name = x["side1"]["name"]
away_team_name = x["side2"]["name"]
competition_name = x["tournament"]["name"]
ramAdd = ramSample.objects.create(title=title, ramyo_donotuse=ramyo_donotuse, date=date, thumbnail=thumbnail, home_team_name=home_team_name, away_team_name=away_team_name, competition_name=competition_name)
ramAdd.save()
return HttpResponse("Successfully submitted!")
This works fine except that it would only save the last objects on the list.
the JSON response list (as a random 60 objects at any time) would look something like:
[
{
"title": "AY - BasketMouth",
"ramyo": "AY de comedian"
"side1": {
"name": "Comedy Central",
"url": "https:\/\/www.rabithole.com\/laugh\/dave-chappel\/"
},
"side2": {
"name": "Basket Mouth",
"url": "https:\/\/www.rabithole.com\/laugh\/chris-rockie\/"
},
"tournament": {
"name": "Night of a thousand laugh",
"id": 15,
"url": "https:\/\/www.rabithole.com\/laugh\/chris-rockie\/"
},
"points": [
{
"nature": "Gentle",
"phrase": "Just stay"
},
{
"nature": "Sarcastic",
"phrase": "Help me"
}
]
},
{
"title": "Dave - Chris",
"ramyo": "Dave Chapelle"
"side1": {
"name": "Comedy Central",
"url": "https:\/\/www.rabithole.com\/laugh\/dave-chappel\/"
},
"side2": {
"name": "Chris Rockie",
"url": "https:\/\/www.rabithole.com\/laugh\/chris-rockie\/"
},
"tournament": {
"name": "Tickle me",
"id": 15,
"url": "https:\/\/www.rabithole.com\/laugh\/chris-rockie\/"
},
"points": [
{
"nature": "Rogue",
"phrase": "Just stay"
}
]
}
]
In this case my views.py will only save the last dictionary on the list, ignoring the other 59.
My question would be:
How do I get the views.py to save the entire objects on the list?
Notice that the "points" is also a list that contains one or more dictionaries, any help how to save this as well?
Your code is saving only the last object in the list because you are creating and saving the object outside of the loop. Try this,
def save_ram_api(request):
r = requests.get('https://ramb.com/ciss-api/v1/')
# data = json.loads(r)
data = r.json()
for x in data:
title = x["title"]
ramyo_donotuse = x["ramyo"]
date = x["date"]
thumbnail = x["thumbnail"]
home_team_name = x["side1"]["name"]
away_team_name = x["side2"]["name"]
competition_name = x["tournament"]["name"]
ramAdd = ramSample.objects.create(title=title, ramyo_donotuse=ramyo_donotuse, date=date, thumbnail=thumbnail, home_team_name=home_team_name, away_team_name=away_team_name, competition_name=competition_name)
ramAdd.save()
return HttpResponse("Successfully submitted!")
How do I get the views.py to save the entire objects on the list?
Notice that the "points" is also a list that contains one or more
dictionaries, any help how to save this as well?
Regarding your those questions
If you are using PostgreSQL as a database then you can use Django's built is JSONField and ArrayField for PostgreSQL database.
And if your database is not PostgreSQL you can use jsonfield library.

Extracting text from json within <script> tag when multiple jsons are present

I am trying to extract the Rating from https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/ in order to extract the "ratingValue" and "alternateName" fields from the HTML code:
<script type=application/ld+json>{
"#context": "http://schema.org",
"#type": "ClaimReview",
"datePublished": "2019-01-03 ",
"url": "https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/",
"author": {
"#type": "Organization",
"url": "https://www.truthorfiction.com/",
"image": "https://dn.truthorfiction.com/wp-content/uploads/2018/10/25032229/truth-or-fiction-logo-tagline.png",
"sameAs": "https://twitter.com/whatstruecom"
},
"claimReviewed": "More Americans die every year from a lack of affordable healthcare than by terrorism or at the hands of undocumented immigrants.",
"reviewRating": {
"#type": "Rating",
"ratingValue": -1,
"worstRating":-1,
"bestRating": -1,
"alternateName": "True"
},
"itemReviewed": {
"#type": "CreativeWork",
"author": {
"#type": "Person",
"name": "Person",
"jobTitle": "",
"image": "",
"sameAs": [
""
]
},
"datePublished": "",
"name": ""
}
}</script>
I have tried to do that using the following code:
import json
from bs4 import BeautifulSoup
slink = 'https://www.truthorfiction.com/are-americans-annually-healthcare-undocumented/'
response = http.request('GET', slink)
soup = BeautifulSoup(response.data)
tmp = json.loads(soup.find('script', type='application/ld+json').text)
However, tmp instead shows a dictionary of the 'application/ld+json' item from the bit preceding the ratings that I would like to extract, and I was wondering how to cycle or loop to the relevant part of the script where the ratings are stored.
it has 2 <script type=application/ld+json> you can select second index from find_all()
tmp = json.loads(soup.find_all('script', type='application/ld+json')[1].text)
or loop and search if it contains the string
tmp = None
for ldjson in soup.find_all('script', type='application/ld+json'):
if 'ratingValue' in ldjson.text:
tmp = json.loads(ldjson.text)
You need to access the element using the keys.
rating_value = tmp['reviewRating']['ratingValue'] # -1
alternate_name = tmp['reviewRating']['alternateName'] # 'True'
or
review_rating = tmp['reviewRating']
rating_value = review_rating['ratingValue'] # -1
alternate_name = review_rating['alternateName'] # 'True'

How to take two keys with a same name in JSON

The code below already takes "street": "Manhattan street 15", but how I can take "PL 300" since they have the same name?
My current python code:
contact_info = dict(business_id=business_id,
name=business_info['name'],
street=address['street'],
post_code=address['postCode'],
city=address['city'],
website=address['website'],
phone=address['phone'],
register_date=register_date
)
And this is the JSON format:
"addresses": [
{
"street": "Manhattan street 15",
"postCode": "53100",
"type": 1,
"city": "Monaco",
"country": "MC",
"website": null,
"phone": null,
"fax": null,
"registrationDate": "2014-11-17",
"endDate": null
},
{
"street": "PL 300",
"postCode": "00089",
"type": 2,
"city": "Halic",
"country": "Hc",
"website": null,
"phone": null,
"fax": null,
"registrationDate": "2014-11-17",
"endDate": null
}
]
The json you have posted its an array of object so you have to get the object from which you want to fetch the street
so var address=adresses[1];
street=address[street];
you can go through iteration
It is seemed address as a listwith two dicts.So
address[0]['street'] #will give you street in first dict
address[1]['street'] #will give you street in second dict
import json
business_info = json.loads('your.json')
streets = [address['street'] for address in business_info.address]
TRY:
from urllib2 import urllib
import json
url = 'http://example.com'
response = urlopen(url)
json_obj = json.load(response)
for i in json_obj['addresses']:
print i['street']
It should work. It'll all the street names within addresses array.
For other values u need to specify those entity names like I did for street
It's a JSON array with two contacts, therefore json["address"][0]["street"] and json["address"][1]["street"] are different.
import json
contact_infos = []
parsed_json = json.loads(json_string)
for addr in parsed_json["addresses"]:
contact_infos.append(
dict(
business_id=9999,
name="Jason Derulo",
street=addr["street"],
post_code=addr["postCode"],
city=addr["city"],
website=addr["website"],
phone=addr["phone"],
register_date=addr["registrationDate"]
)
)
# A list of two contact infos
print(contact_infos)

Categories