Parse values using BeautifulSoup - python

I want to get src url from all this html using python.
I had this code to get text:
avatar = html.find(class_ = "image-container image-container-player image-container-bigavatar")
print(avatar)
<div class="image-container image-container-player image-container-bigavatar"><img alt="twitch.tv/gh0stluko" class="image-player image-bigavatar" data-tooltip-url="/players/1227965603/tooltip" rel="tooltip-remote" src="https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/b6/b6e83cac75233208d4bb79811d768fdf17dbd46e_full.jpg" title="twitch.tv/gh0stluko"/></div>

Assuming you only want to find images, add "img" as the first argument to find().
Then you can look at the ["src"] attribute directly:
avatar = html.find("img", class_="...")
print(avatar["src"])

Related

xpath extract text that is wrapped in href

I am new to using XPath and was confused about how to retrieve text that is wrapped in href (where I do not need href info.).
URL I am using:
https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season
HTML looks like below
Where I want to extract text "Gonzalo Pineda" and the code I have written is this
import requests
from lxml import HTML
t = requests.get("https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season")
dom_tree = html.fromstring(t.content)
coach = updated_dom_tree.xpath('//table[contains(#class, "wikitable")][2]//tbody/tr/td[2]/text()')
Where this is second table from a Wikipedia page so that's why I wrote //table[contains(#class, "wikitable")][2].
Then I wrote //tbody/tr/td[2]/text() just following the HTML structure. However, the outcome of the code returns " " (empty string with \n") without actual text info.
I tried to change the code to a[contains(#href, "wiki")] after searching through the stack overflow but I was getting an output of an empty list ([]).
I would really appreciate it if you could help me with this!
Edit:
I can't copy-paste HTML because I just got that from inspect tool (F12).
The HTML line can be found in "Personnel and sponsorships" table second column. Thank you!
this is the approach you can take the extract the table. extra import pandas just to see how the table data is coming along.
import requests
from lxml import html
import pandas as pd # to data storage, using pandas
t = requests.get("https://en.wikipedia.org/wiki/2022_Major_League_Soccer_season")
dom_tree = html.fromstring(t.content)
#coach = updated_dom_tree.xpath('//table[contains(#class, "wikitable")][2]//tbody/tr/td[2]/text()')
table_data=[]
table_header=[]
table=dom_tree.xpath("//table")[2]
for e_h in table.findall(".//tr")[0]:
table_header.append(e_h.text_content().strip())
table_data={0:[],1:[],2:[],3:[],4:[]}
for e_tr in table.findall(".//tr")[1:]:
col_no=0
for e_td in e_tr.findall(".//td"):
if e_td.text_content().strip():
table_data[col_no].append(e_td.text_content().strip())
else:
table_data[col_no].append('N/A') # handle spaces,
if e_td.get('colspan') == '2': ## handle example team Colorado Rapids, where same td spans
col_no+=1
table_data[col_no].append('N/A')
else:
col_no+=1
for i in range(0,5):
table_data[table_header[i]] = table_data.pop(i)
df=pd.DataFrame(table_data)
if just want Head Coach column, then table_data['Head coach'] will give those.
output:
If I understand you correctly, what you are looking for is
dom_tree.xpath('(//table[contains(#class, "wikitable")]//span[#class="flagicon"])[1]//following-sibling::span//a/text()')[0]
Output should be
'Gonzalo Pineda'
EDIT:
To get the name of all head coaches, try:
for coach in dom_tree.xpath('(//table[contains(#class, "wikitable")])[2]//tbody//td[2]'):
print(coach.xpath('.//a/text()')[0])
Another way to do it - using pandas
import pandas as pd
tab = dom_tree.xpath('(//table[contains(#class, "wikitable")])[2]')[0]
pd.read_html(HTML.tostring(tab))[0].loc[:, 'Head coach']

Problem with lxml.xpath not putting elements into a list

So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.
import requests
from lxml import html
def main():
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# the root of the tracker website
page = html.fromstring(result.content)
print('its getting the element from here', page)
threesRank = page.xpath('//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
print('the 3s rank is: ', threesRank)
if __name__ == "__main__":
main()
OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"
its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is: []
Process finished with exit code 0
The output next to "the 3s rank is:" should look something like this
[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]
Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.
You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.
Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.
Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
rocketleague = json.loads(json_string)
# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below: since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress
with suppress(KeyError):
platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
print(platform['name'])
# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
print(f"\nTitle: {title['name']}")
for platform in title['platforms']:
print(f"\tPlatform: {platform['name']}")
lxml doesn't support "tbody". change your xpath to
'//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'

How do I get data from dict using xpath (Scrapy)

I'm trying to get vendor id from this restaurant page using xpath but I don't know how to get that cause it's inside dictionary. And this is what I tried //*[#class="vendor"] then confused
<section class ="vendor" data-vendor="{"id":20707,"name":"Beard Papa\u0027s (Novena)","code":"x1jg","category":"Dessert,Bakery and Cakes","is_active":true,"timezone":"Asia\/Singapore","available_in":"2020-11-11T10:00:00+0800","city_id":1,"latitude":1.320027,"longitude":103.843897,"is_pickup_available":true,"is_delivery_available":true,"url_key":"beard-papas-novena","ddt":30,"chain":{"id":1992,"name":"Beard Papa","url_key":"beard-papa","is_accepting_global_vouchers":true,"code":"ck0vb","main_vendor_code":"v3lv"},"rating":4.6,"review_number":224,"minOrder":0,"deliveryFee":0,"is_promoted":false,"is_premium":false,"toppings":[],"accepts_instructions":true,"category_quantity":3,"loyalty_program_enabled":false,"loyalty_percentage_amount":0,"has_delivery_provider":true,"vertical":"restaurants","is_preorder_enabled":true,"is_vat_visible":true}">
The right way (as already pointed by booleantrue will be to import json and next:
data_vendor = response.xpath('//section[#class="vendor"]/#data-vendor').get()
data_vendor = json.loads(data_vendor)
vendor_id = data_vendor['id']

Using Beautiful Soup to create new_tag with attribute named "name"

I've got a block of XML that I need to insert some elements into
<importer in="!SRCFILE!" media="movie">
<video-out id="video_2_importer"></video-out>
<audio-out id="audio_2_importer"></audio-out>
</importer>
What I need to do is insert a few options into this block so my output looks like this:
<importer media="movie" in="!SRCFILE!">
<video-out id="video_2_importer"></video-out>
<audio-out id="audio_2_importer"></audio-out>
<option name="start-time" value="60"></option>
<option name="end-time" value="120"></option>
</importer>
I've successfully used bs4 to find the element and create new tags, but it appears the argument 'name' is a reserved word in bs4. I've tried the following:
in_point = soup.new_tag('option', **{'value':'60','name':'start-time'})
But I get the following error
TypeError: new_tag() got multiple values for keyword argument 'name'
If I remove the 'name':'start-time' from my dict, it does properly insert. If I change 'name' to any other text, it works. So doing the following results in a proper tag creation.
in_point = soup.new_tag('option', **{'value':'60','stuff':'start-time'})
I know there is likely something I'm doing wrong syntacticly to get around allowing me to use the attribute 'name', I just have no idea what.
In this case, you can create the instance of the Tag this way:
from bs4 import BeautifulSoup, Tag
in_point = Tag(builder=soup.builder,
name='option',
attrs={'value':'60','name':'start-time'})
which is essentially what new_tag() is doing under-the-hood:
def new_tag(self, name, namespace=None, nsprefix=None, **attrs):
"""Create a new tag associated with this soup."""
return Tag(None, self.builder, name, namespace, nsprefix, attrs)
Beautiful Soup uses dictionaries to build tags. You don't need a Tag import to do this. Just create an attribute as you'd create a key: value in a dictionary.
For example: importer['media'] = "movie"

Convert YouTube link into an embed link

I am using Python and Flask, and I have some YouTube URLs I need to convert to their embed versions. For example, this:
https://www.youtube.com/watch?v=X3iFhLdWjqc
has to be converted into this:
https://www.youtube.com/embed/X3iFhLdWjqc
Should I use Regexp, or is there a Flask method to convert the URLs?
Assuming your URLs are just strings, you don't need regexes or special Flask functions to do it.
This code will replace all YouTube URLs with the embedded versions, based off of how you said it should be handled:
url = "https://youtube.com/watch?v=TESTURLNOTTOBEUSED"
url = url.replace("watch?v=", "embed/")
All you have to do is replace url with whatever variable you store the URL in.
To do this for a list, use:
new_url_list = list()
for address in old_url_list:
new_address = address.replace("watch?v=", "embed/")
new_url_list.append(new_address)
old_url_list = new_url_list
where old_url_list is the list which your URLs are included in.
There are two types of youtube links:
http://www.youtube.com/watch?v=xxxxxxxxxxx
or
http://youtu.be/xxxxxxxxxxx
Use this function it will work with the 2 kinds of youtube links.
import re
def embed_url(video_url):
regex = r"(?:https:\/\/)?(?:www\.)?(?:youtube\.com|youtu\.be)\/(?:watch\?v=)?(.+)"
return re.sub(regex, r"https://www.youtube.com/embed/\1",video_url)
You can try this:
import re
videoUrl = "https://www.youtube.com/watch?v=X3iFhLdWjqc"
embedUrl = re.sub(r"(?ism).*?=(.*?)$", r"https://www.youtube.com/embed/\1", videoUrl )
print (embedUrl)
Output:
https://www.youtube.com/embed/X3iFhLdWjqc
Demo

Categories