How do I get data from dict using xpath (Scrapy)

How do I get data from dict using xpath (Scrapy) - python

I'm trying to get vendor id from this restaurant page using xpath but I don't know how to get that cause it's inside dictionary. And this is what I tried //*[#class="vendor"] then confused
<section class ="vendor" data-vendor="{"id":20707,"name":"Beard Papa\u0027s (Novena)","code":"x1jg","category":"Dessert,Bakery and Cakes","is_active":true,"timezone":"Asia\/Singapore","available_in":"2020-11-11T10:00:00+0800","city_id":1,"latitude":1.320027,"longitude":103.843897,"is_pickup_available":true,"is_delivery_available":true,"url_key":"beard-papas-novena","ddt":30,"chain":{"id":1992,"name":"Beard Papa","url_key":"beard-papa","is_accepting_global_vouchers":true,"code":"ck0vb","main_vendor_code":"v3lv"},"rating":4.6,"review_number":224,"minOrder":0,"deliveryFee":0,"is_promoted":false,"is_premium":false,"toppings":[],"accepts_instructions":true,"category_quantity":3,"loyalty_program_enabled":false,"loyalty_percentage_amount":0,"has_delivery_provider":true,"vertical":"restaurants","is_preorder_enabled":true,"is_vat_visible":true}">

The right way (as already pointed by booleantrue will be to import json and next:
data_vendor = response.xpath('//section[#class="vendor"]/#data-vendor').get()
data_vendor = json.loads(data_vendor)
vendor_id = data_vendor['id']

Related

Parse values using BeautifulSoup

I want to get src url from all this html using python.
I had this code to get text:
avatar = html.find(class_ = "image-container image-container-player image-container-bigavatar")
print(avatar)
<div class="image-container image-container-player image-container-bigavatar"><img alt="twitch.tv/gh0stluko" class="image-player image-bigavatar" data-tooltip-url="/players/1227965603/tooltip" rel="tooltip-remote" src="https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/b6/b6e83cac75233208d4bb79811d768fdf17dbd46e_full.jpg" title="twitch.tv/gh0stluko"/></div>

Assuming you only want to find images, add "img" as the first argument to find().
Then you can look at the ["src"] attribute directly:
avatar = html.find("img", class_="...")
print(avatar["src"])

Problem with lxml.xpath not putting elements into a list

So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.
import requests
from lxml import html
def main():
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# the root of the tracker website
page = html.fromstring(result.content)
print('its getting the element from here', page)
threesRank = page.xpath('//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
print('the 3s rank is: ', threesRank)
if __name__ == "__main__":
main()
OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"
its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is: []
Process finished with exit code 0
The output next to "the 3s rank is:" should look something like this
[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]

Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.
You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.
Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.
Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
rocketleague = json.loads(json_string)
# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below: since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress
with suppress(KeyError):
platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
print(platform['name'])
# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
print(f"\nTitle: {title['name']}")
for platform in title['platforms']:
print(f"\tPlatform: {platform['name']}")

lxml doesn't support "tbody". change your xpath to
'//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'

Q: Get Item's Title Python eBay SDK

I am trying to get item's title with "GetSingleItem" method by providing the ItemID, but it does not work.
Here is the code:
from ebaysdk.shopping import Connection as Shopping
api = Shopping(appid='&',certid='&',devid='&',token='&')
ItemID=&
a = print (api.execute('GetSingleItem',{'ItemID':ItemID,'IncludeSelector':['Title']}))
print(a)
The response:
<ebaysdk.response.Response object at 0x003A3B10>
None

You don't need to specify title in your GET request. Ebays Shopping API provides that output field by default. You can check in their documentation here
It should be noted however, that when using 'InputSelector' it should come before 'ItemId' as the order seems to matter. So your code should look like this.
api.execute('GetSingleItem', {'IncludeSelector':outputField,'ItemID':ItemID})
Where outputField could be
Compatibility,
Description, Details, ItemSpecifics, ShippingCosts, TextDescription, Variations
To answer your question simply execute:
response = api.execute('GetSingleItem', {'ItemID':ItemID})
title = response.dict()['Item']['Title']
print(title)

I think you need to put the itemID like this
{
"ItemID": "000000000000"
}

Using Beautiful Soup to create new_tag with attribute named "name"

I've got a block of XML that I need to insert some elements into
<importer in="!SRCFILE!" media="movie">
<video-out id="video_2_importer"></video-out>
<audio-out id="audio_2_importer"></audio-out>
</importer>
What I need to do is insert a few options into this block so my output looks like this:
<importer media="movie" in="!SRCFILE!">
<video-out id="video_2_importer"></video-out>
<audio-out id="audio_2_importer"></audio-out>
<option name="start-time" value="60"></option>
<option name="end-time" value="120"></option>
</importer>
I've successfully used bs4 to find the element and create new tags, but it appears the argument 'name' is a reserved word in bs4. I've tried the following:
in_point = soup.new_tag('option', **{'value':'60','name':'start-time'})
But I get the following error
TypeError: new_tag() got multiple values for keyword argument 'name'
If I remove the 'name':'start-time' from my dict, it does properly insert. If I change 'name' to any other text, it works. So doing the following results in a proper tag creation.
in_point = soup.new_tag('option', **{'value':'60','stuff':'start-time'})
I know there is likely something I'm doing wrong syntacticly to get around allowing me to use the attribute 'name', I just have no idea what.

In this case, you can create the instance of the Tag this way:
from bs4 import BeautifulSoup, Tag
in_point = Tag(builder=soup.builder,
name='option',
attrs={'value':'60','name':'start-time'})
which is essentially what new_tag() is doing under-the-hood:
def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
"""Create a new tag associated with this soup."""
return Tag(None, self.builder, name, namespace, nsprefix, attrs)

Beautiful Soup uses dictionaries to build tags. You don't need a Tag import to do this. Just create an attribute as you'd create a key: value in a dictionary.
For example: importer['media'] = "movie"

Unable to view item in browser based on its key in Python GAE

I'm using python GAE with webapp.
I have a form for a user to create a object in the database, something like:
class SpamRecord(db.Model):
author = db.ReferenceProperty(Author, required=True)
text = db.StringProperty()
After it's created, the user is redirected to a page whose URL contains that object's key... using code such as:
spam = SpamRecord(author=author, text=text)
spam.put()
new_spam_key = spam.key()
self.redirect("/view_spam/%s" % new_spam_key)
And this mostly works, with me being able to view items at:
sitename.com/view_spam/ag1waWNreXByZXNlbnRzchQLEgxBbm5vdW5jZW1lbnQYy8oJDA
sitename.com/view_spam/ag1waWNreXByZXNlbnRzchQLEgxBbm5vdW5jZW1lbnQY_boJDA
However, there's an occasional key that won't work. Here are 2 recent examples of pages that won't load and return HTTP 404 not found errors:
sitename.com/view_spam/ag1waWNreXByZXNlbnRzchQLEgxBbm5vdW5jZW1lbnQY-5MJDA
sitename.com/view_spam/ag1waWNreXByZXNlbnRzchQLEgxBbm5vdW5jZW1lbnQY-boJDA
My html-mappings.py contains the following mapping:
(r"/view_spam/(\w+)", ViewSpamPage)
And the ViewSpamPage looks something like:
class ViewSpamPage(webapp.RequestHandler):
def get(self, spam_id):
self.response.out.write("Got here")
Can anyone offer any insight as to why this is occurring and how it may be prevented?
Thanks very much!

In regular expressions, \w doesn't match hyphens. (It will match underscores.) For that second pair of keys, this'll result in only passing part of the key to your handler.
In your URL pattern, try r"/view_spam/(.*)" instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I get data from dict using xpath (Scrapy) - python

The right way (as already pointed by booleantrue will be to import json and next: data_vendor = response.xpath('//section[#class="vendor"]/#data-vendor').get() data_vendor = json.loads(data_vendor) vendor_id = data_vendor['id']

Related

Parse values using BeautifulSoup

Problem with lxml.xpath not putting elements into a list

Q: Get Item's Title Python eBay SDK

Using Beautiful Soup to create new_tag with attribute named "name"

Unable to view item in browser based on its key in Python GAE

Categories

Resources