Issue Parsing a site with lxml and xpath in python

Issue Parsing a site with lxml and xpath in python - python

I think I am messing up my xpath. What I am trying to do is get the information of each row on the table in this page.
This is what I have so far but its not outputting what I'm looking for.
import requests
from lxml import etree
r = requests.get('http://mtgoclanteam.com/Cards?edition=DTK')
doc = etree.HTML(r.text)
#get list of cards
cards = [card for card in doc.xpath('id("cardtable")/x:tbody/x:tr[1]/x:td[3]')]
for card in cards:
print card

The primary problem here is that the actual document served from the server contains an empty table:
<table id="cardtable" class="cardlist"/>
The data is filled in after the page loads by the embedded javascript that follows the empty table element:
<script>
$('#cardtable').dataTable({
"aLengthMenu": [[25, 100, -1], [25, 100, "All"]],
"bDeferRender": true,
"aaSorting": [],
"bPaginate": false,
"aaData": [
...DATA IS HERE...
],
"aoColumns": [
{ "sTitle": "Card name", "sWidth": "260" },
{ "sTitle": "Rarity", "sWidth": "40" },
{ "sTitle": "Buy", "sWidth": "80" },
{ "sTitle": "Sell", "sWidth": "80" },
{ "sTitle": "Bots with stock" }]
})
</script>
The data itself is contained the aaData element of the dictionary
that is passed to the dataTable() method. Extracting this in Python
is going to be tricky (this isn't just a JSON document). Possibly a
suitable regular expression applied to the script text would get you
what you want (or just iterate over the lines of the script and take the one after the aaData key).
For example:
import pprint
import json
import requests
from lxml import etree
r = requests.get('http://mtgoclanteam.com/Cards?edition=DTK')
doc = etree.HTML(r.text)
script = doc.xpath('id("templatemo_content")/script')[0].text
found = False
result = None
for line in script.splitlines():
if found:
if '[' in line:
result=line
break
if 'aaData' in line:
found = True
if result:
result =json.loads('[' + result + ']')
pprint.pprint(result)
This is ugly and fragile (it would break if the format of the script
changed), but it works for the current input.

Related

Get specific information out of application/ld+json

I am looking to get just the "ratingValue" and "reviewCount" from the following application/ld+json but cannot figure out how to do this after looking through numerous How-to's, so I've essentially given up. In advance thank you for your help.
Sample of the application/ld+json
{
"#context": "http://schema.org",
"#graph": [
{
"#type": "Product",
"name": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) ",
"description": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) - FilterBuy.com",
"productID": 30100,
"sku": "ABR20x20x5M8",
"mpn": "ABR20x20x5M8",
"url": "https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20/merv-8/",
"itemCondition": "new",
"brand": "FilterBuy",
"image": "https://filterbuy.com/media/pla_images/20x25x5AB/20x25x5AB-m8-(x1).jpg",
"aggregateRating": {
"#type": "AggregateRating",
**"ratingValue": 4.79926,
"reviewCount": 538**}
My Code:
from bs4 import BeautifulSoup
import bs4
import requests
import json
import re
import numpy as np
import csv
urls = ['https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20',
'https://filterbuy.com/brand/trion-air-bear-air-filters/16x25x5-air-bear-1400/?selected_merv=11']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
mervs = BeautifulSoup(response.text, 'lxml').find_all('strong')
product = BeautifulSoup(response.text, 'lxml').find("h1", class_="text-center")
jsonString = soup.find_all('script', type='application/ld+json')[1].text
json_schema = soup.find_all('script', attrs={'type': 'application/ld+json'})[1]
json_file = json.loads(json_schema.get_text())
for i, cart in enumerate(BeautifulSoup(response.text, 'lxml').find_all('form', class_='cart')):
for tax in cart.attrs:
if 'data-price' in tax:
print(product.text, mervs[i].get_text(), [tax], cart[tax], json_file)

In python, pretty much anything can be nested inside other things. This is an example of nesting lists and dictionaries inside a dictionary. You can go about getting the value by thinking about what you need to do on each level.
Start by assigning the above dictionary to a variable, like the_dict. You want to access the "#graph" key, then access the first item in the list it returns, then access "aggregateRating". From there, you can get both the values you want. Your code may look something like this:
the_dict = ...
d = the_dict['#graph'][0][aggregateRating']
rating_value, review_count = d['ratingValue'], d['reviewCount']

How to requests all sizes in stock - Python

I'm trying to request all the sizes in stock from Zalando. I can not quite figure out how to do it since the video I'm watching
showing how to request sizes look different than min.
The video that I watch was this. Video - 5.30
Does anyone know how to request the sizes in stock and print the sizes that in stock?
The site in trying to request sizes of: here
My code looks like this:
import requests
from bs4 import BeautifulSoup as bs
session = requests.session()
def get_sizes_in_stock():
global session
endpoint = "https://www.zalando.dk/nike-sportswear-air-max-90-sneakers-ni112o0bt-a11.html"
response = session.get(endpoint)
soup = bs(response.text, "html.parser")
I have tried to go to the View page source and look for the sizes, but I could not see the sizes in the page source.
I hope someone out there can help me what to do.

The sizes are in the page
I found them in the html, in a javascript tag, in the format
{
"sku": "NI112O0BT-A110090000",
"size": "42.5",
"deliveryOptions": [
{
"deliveryTenderType": "FASTER"
}
],
"offer": {
"price": {
"promotional": null,
"original": {
"amount": 114500
},
"previous": null,
"displayMode": null
},
"merchant": {
"id": "810d1d00-4312-43e5-bd31-d8373fdd24c7"
},
"selectionContext": null,
"isMeaningfulOffer": true,
"displayFlags": [],
"stock": {
"quantity": "MANY"
},
"sku": "NI112O0BT-A110090000",
"size": "42.5",
"deliveryOptions": [
{
"deliveryTenderType": "FASTER"
}
],
"offer": {
"price": {
"promotional": null,
"original": {
"amount": 114500
},
"previous": null,
"displayMode": null
},
"merchant": {
"id": "810d1d00-4312-43e5-bd31-d8373fdd24c7"
},
"selectionContext": null,
"isMeaningfulOffer": true,
"displayFlags": [],
"stock": {
"quantity": "MANY"
}
},
"allOffers": [
{
"price": {
"promotional": null,
"original": {
"amount": 114500
},
"previous": null,
"displayMode": null
},
"merchant": {
"id": "810d1d00-4312-43e5-bd31-d8373fdd24c7"
},
"selectionContext": null,
"isMeaningfulOffer": true,
"displayFlags": [],
"stock": {
"quantity": "MANY"
},
"deliveryOptions": [
{
"deliveryWindow": "2022-05-23 - 2022-05-25"
}
],
"fulfillment": {
"kind": "ZALANDO"
}
}
]
}
}
If you parse the html with bs4 you should be able to find the script tag and extract the JSON.

The sizes for the default color of shoe are shown in html. Alongside this are the urls for the other colors. You can extract these into a dictionary and loop, making requests and pulling the different colors and their availability, which I think is what you are actually requesting, as follows (note: I have kept quite generic to avoid hardcoding keys which change across requests):
import requests, re, json
def get_color_results(link):
headers = {"User-Agent": "Mozilla/5.0"}
r = requests.get(link, headers=headers).text
data = json.loads(re.search(r'(\{"enrichedEntity".*size.*)<\/script', r).group(1))
results = []
color = ""
for i in data["graphqlCache"]:
if "ern:product" in i:
if "product" in data["graphqlCache"][i]["data"]:
if "name" in data["graphqlCache"][i]["data"]["product"]:
results.append(data["graphqlCache"][i]["data"]["product"])
if (
color == ""
and "color" in data["graphqlCache"][i]["data"]["product"]
):
color = data["graphqlCache"][i]["data"]["product"]["color"]["name"]
return (color, results)
link = "https://www.zalando.dk/nike-sportswear-air-max-90-sneakers-ni112o0bt-a11.html"
final = {}
color, results = get_color_results(link)
colors = {
j["node"]["color"]["name"]: j["node"]["uri"]
for j in [
a
for b in [
i["family"]["products"]["edges"]
for i in results
if "family" in i
if "products" in i["family"]
]
for a in b
]
}
final[color] = {
j["size"]: j["offer"]["stock"]["quantity"]
for j in [i for i in results if "simples" in i][0]["simples"]
}
for k, v in colors.items():
if k not in final:
color, results = get_color_results(v)
final[color] = {
j["size"]: j["offer"]["stock"]["quantity"]
for j in [i for i in results if "simples" in i][0]["simples"]
}
print(final)
Explanatory notes from chat:
Use chrome browser to navigate to link
Press Ctrl + U to view page source
Press Ctrl + F to search for 38.5 in html
The first match is the long string you already know about. The string is long and difficult to navigate in page source and identify which tag it is part of. There are a number of ways I could identify the right script from these, but for now, an easy way would be:
from bs4 import BeautifulSoup as bs
link = 'https://www.zalando.dk/nike-sportswear-air-max-90-sneakers-ni112o0bt-a11.html'
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(link, headers = headers)
soup = bs(r.text, 'lxml')
for i in soup.select('script[type="application/json"]'):
if '38.5' in i.text:
print(i)
break
Slower method would be:
soup.find("script", text=re.compile(r'.*38.5.*'))
Whilst I used bs4 to get the right script tag contents, this was so I knew the start and end of the string denoting the JavaScript object I wanted to use re to extract, and then to deserialize into a JSON object with json; this in a re-write to use re rather than bs4 i.e. use re on entire response text, from the request, and pass a regex pattern which would pull out the same string
I put the entire page source in a regex tool and wrote a regex to return that same string as identified above. See that regex here
Click on right hand side, match 1 group 1, to see highlighted the same string being returned from regex as you saw with BeautifulSoup. Two different ways of getting the same string containing the sizes
That is the string which I needed to examine, as JSON, the structure of. See in json viewer here
You will notice the JSON is very nested with some keys to dictionaries that are likely dynamic, meaning I needed to write code which could traverse the JSON and use certain more stable keys to pull out the colours available, and for the default shoe colour the sizes and availability
There is an expand all button in that JSON viewer. You can then search with Ctrl + F for 38.5 again
10a) I noticed that size and availability were for the default shoe colour
10b) I also noticed that within JSON, if I searched by one of the other colours from the dropdown, I could find URIs for each colour of show listed
I used Wolf as my search term (as I suspected less matches for that term within the JSON)
You can see one of the alternate colours and its URI listed above
I visited that URI and found the availability and shoe sizes for that colour in same place as I did for the default white shoes
I realised I could make an initial request and get the default colour and sizes with availability. From that same request, extract the other colours and their URIs
I could then make requests to those other URIs and re-use my existing code to extract the sizes/availability for the new colours
This is why I created my get_color_results() function. This was the re-usable code to extract the sizes and availability from each page
results holds all the matches within the JSON to certain keys I am looking for to navigate to the right place to get the sizes and availabilities, as well as the current colour
This code traverses the JSON to get to the right place to extract data I want to use later
results = []
color = ""
for i in data["graphqlCache"]:
if "ern:product" in i:
if "product" in data["graphqlCache"][i]["data"]:
if "name" in data["graphqlCache"][i]["data"]["product"]:
results.append(data["graphqlCache"][i]["data"]["product"])
if (
color == ""
and "color" in data["graphqlCache"][i]["data"]["product"]
):
color = data["graphqlCache"][i]["data"]["product"]["color"]["name"]
The following pulls out the sizes and availability from results:
{
j["size"]: j["offer"]["stock"]["quantity"]
for j in [i for i in results if "simples" in i][0]["simples"]
}
For the first request only, the following gets the other shoes colours and their URIs into a dictionary to later loop:
colors = {
j["node"]["color"]["name"]: j["node"]["uri"]
for j in [
a
for b in [
i["family"]["products"]["edges"]
for i in results
if "family" in i
if "products" in i["family"]
]
for a in b
]
}
This bit gets all the other colours and their availability:
for k, v in colors.items():
if k not in final:
color, results = get_color_results(v)
final[color] = {
j["size"]: j["offer"]["stock"]["quantity"]
for j in [i for i in results if "simples" in i][0]["simples"]
}
Throughout, I update the dictionary final with the found colour and associated size and availabilities

Always check if an hidden api is available, it will save you a looooot of time.
In this case I found this api:
https://www.zalando.dk/api/graphql
You can pass a payload and you obtain a json answer
# I extracted the payload from the network tab of my browser debbuging tools
payload = """[{"id":"0ec65c3a62f6bd0b29a59f22021a44f42e6282b7f8ff930718a1dd5783b336fc","variables":{"id":"ern:product::NI112O0S7-H11"}},{"id":"0ec65c3a62f6bd0b29a59f22021a44f42e6282b7f8ff930718a1dd5783b336fc","variables":{"id":"ern:product::NI112O0RY-A11"}}]"""
conn = http.client.HTTPSConnection("www.zalando.dk")
headers = {
'content-type': "application/json"
}
conn.request("POST", "/api/graphql", payload, headers)
res = conn.getresponse()
res = res.read() # json output
res contains for each product a json leaf containing the available size:
"simples": [
{
"size": "38.5",
"sku": "NI112O0P5-A110060000"
},
{
"size": "44.5",
"sku": "NI112O0P5-A110105000"
},
{
...
It's now easy to extract the informations.
There also is a field that indicate if the product got a promotion or not, cool if you want to track a discount.

How to traverse over JSON object

I am getting following response from an API, I want to extract Phone Number from this object in Python. How can I do that?
{
"ParsedResults": [
{
"TextOverlay": {
"Lines": [
{
"Words": [
{
"WordText": "+971555389583", //this field
"Left": 0,
"Top": 5,
"Height": 12,
"Width": 129
}
],
"MaxHeight": 12,
"MinTop": 5
}
],
"HasOverlay": true,
"Message": "Total lines: 1"
},
"TextOrientation": "0",
"FileParseExitCode": 1,
"ParsedText": "+971555389583 \r\n",
"ErrorMessage": "",
"ErrorDetails": ""
}
],
"OCRExitCode": 1,
"IsErroredOnProcessing": false,
"ProcessingTimeInMilliseconds": "308",
"SearchablePDFURL": "Searchable PDF not generated as it was not requested."**strong text**}

Store the API response to a variable. Let's call it response.
Now convert the JSON string to a Python dictionary using the json module.
import json
response_dict = json.loads(response)
Now traverse through the response_dict to get the required text.
phone_number = response_dict["ParsedResults"][0]["TextOverlay"]["Lines"][0]["Words"][0]["WordText"]
Wherever the dictionary value is an array, [0] is used to access the first element of the array. If you want to access all elements of the array, you will have to loop through the array.

You have to parse the resulting stirng into a dictionary using the library json, afterwards you can traverse the result by looping over the json structure like this:
import json
raw_output = '{"ParsedResults": [ { "Tex...' # your api response
json_output = json.loads(raw_output)
# iterate over all lists
phone_numbers = []
for parsed_result in json_output["ParsedResults"]:
for line in parsed_result["TextOverlay"]["Lines"]:
# now add all phone numbers in "Words"
phone_numbers.extend([word["WordText"] for word in line["Words"]])
print(phone_numbers)
You may want to check if all keys exist within that process, depending on the API you use, like
# ...
for line in parsed_result["TextOverlay"]["Lines"]:
if "Words" in line: # make sure key exists
phone_numbers.extend([word["WordText"] for word in line["Words"]])
# ...

Retrieve data from json using Python

I am trying to retrieve the urls from the sub elements given comp1 and comp2 as input to the python script
{
"main1": {
"comp1": {
"url": [
"http://kcdclcm.com",
"http://dacklsd.com"
]
},
"comp2": {
"url": [
"http://dccmsdlkm.com",
"http://clsdmcsm.com"
]
}
},
"main2": {
"comp3": {
"url": [
"http://csdc.com",
"http://uihjkn.com"
]
},
"comp4": {
"url": [
"http://jkll.com",
"http://ackjn.com"
]
}
}
}
Following is the snippet of the python function, I am trying to use to grab the urls
import json
data = json.load(open('test.json'))
def geturl(comp):
if comp in data[comp]:
for url in data[comp]['url']:
print url
geturl('comp1')
geturl('comp2')
I totally understand the error is in the 4th and 5th line of the script, since i am trying to grab the url information from the second element of the json data without passing the first element 'main1' or 'main2'. Same script works fine if I replace the 4th and 5th line as below:
if comp in data['main1']:
for url in data['main1'][comp]['url']:
In my case, i would not know main1 and main2 as the user would just pass comp1, comp2, comp3 and comp4 part as input to the script. Is there a way to find the url information given that only the second element is known
Any inputs would be highly appreciated.

You need to iterate through the keys/values in the dict to check if the second level key you are searching for is present:
import json
data = json.load(open('test.json'))
def geturl(comp):
for k, v in data.items():
if comp in v and 'url' in v[comp]:
print "%s" % "\n".join(v[comp]['url'])
geturl('comp1')
geturl('comp2')

If you want to search the urls with only comp key in every main, you just need to do it like this:
import json
data = json.load(open('test.json'))
def geturl(comp):
for mainKey in data:
main = data[mainKey]
if comp in main:
urls = main[comp]['url']
for url in urls:
print url
geturl('comp1')
geturl('comp2')

Traversing json array in python

I'm using urllib.request.urlopen to get a JSON response that looks like this:
{
"batchcomplete": "",
"query": {
"pages": {
"76972": {
"pageid": 76972,
"ns": 0,
"title": "Title",
"thumbnail": {
"original": "https://linktofile.com"
}
}
}
}
The relevant code to get the response:
response = urllib.request.urlopen("https://example.com?title="+object.title)
data = response.read()
encoding = response.info().get_content_charset('utf-8')
json_object = json.loads(data.decode(encoding))
I'm trying to retrieve the value of "original", but I'm having a hard time getting there.
I can do print(json_object['query']['pages'] but once I do print(json_object['query']['pages'][0] I run into a KeyError: 0.
How would I be able to, with python retrieve the value of original?

Do this instead:
my_content = json_object['query']['pages']['76972']['thumbnail']['original']
The reason is, you need to mention index as [0] only when you have list as the object. But in your case, every item is of dict type. You need to specify key instead of index
If number is dynamic, you may do:
page_content = json_object['query']['pages']
for content in page_content.values():
my_content = content['thumbnail']['original']
where my_content is the required information.

Doing [0] is looking for that key - which doesn't exist. Assuming you don't always know what the key of the page is, Try this:
pages = json_object['query']['pages']
for key, value in pages.items(): # this is python3
original = value['thumbnail']['original']
Otherwise you can simply grab it by the key if you do know (what appears to be) the pageid:
json_object['query']['pages']['76972']['thumbnail']['original']

You can iterate over keys:
for page_no in json_object['query']['pages']:
page_data = json_object['query']['pages'][page_no]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue Parsing a site with lxml and xpath in python - python

Related

Get specific information out of application/ld+json

How to requests all sizes in stock - Python

How to traverse over JSON object

Retrieve data from json using Python

Traversing json array in python

Categories

Resources