For my project I need to extract the CSS Selectors for a given element that I will find through parsing. What I do is navigate to a page with selenium and then with python-beautiful soup I parse the page and find if there are any elements that I need the CSS Selector of.
For example I may try to find any input tags with id "print".
soup.find_all('input', {'id': 'print')})
If I manage to find such an element I want to fetch its extract it's CSS Selector, something like "input#print". I don't just find using id's but also a combination of classes and regular expressions.
Is there any way to achieve this?
Try this.
from scrapy.selector import Selector
from selenium import webdriver
link = "https://example.com"
xpath_desire = "normalize-space(//input[#id = 'print'])"
path1 = "./chromedriver"
driver = webdriver.Chrome(executable_path=path1)
driver.get(link)
temp_test = driver.find_element_by_css_selector("body")
elem = temp_test.get_attribute('innerHTML')
value = Selector(text=elem).xpath(xpath_desire).extract()[0]
print(value)
Ok, I am totally new to Python so i am sure that there is a better answer for this, but here's my two cents :)
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/49168556/extract-css-selector-for-
an-element-with-selenium"
element = 'a'
idName = 'nav-questions'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all(element, id = idName)
if tags:
for tag in tags :
getClassNames = tag.get('class')
classNames = ''.join(str('.' + x) for x in getClassNames)
print element + '#' + idName + classNames
else:
print ':('
This would print something like:
a#nav-questions.-link.js-gps-track
Related
I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results
You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]
From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")
Hello this is the code I want to grab first link from using BeautifulSoup.
view-source:https://www.binance.com/en/blog
I want to grab the first article here so it would be "Trust Wallet Now Supports Stellar Lumens, 4 More Tokens"
I am trying to use Python for this.
I use this code but it grabs all the links, I only want first one to grab
with open('binanceblog1.html', 'w') as article:
before13 = requests.get("https://www.binance.com/en/blog", headers=headers2)
data1b = before13.text
xsoup2 = BeautifulSoup(data1b, "lxml")
for div in xsoup2.findAll('div', attrs={'class':'title sc-0 iaymVT'}):
before_set13 = div.find('a')['href']
How can I do this?
Most simple solution I can think at this moment that works with your code is to use break, this is because of findAll
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
break
For just the first element you can use find
before_set13 = soup.find('div', attrs={'class':'title sc-62mpio-0 iIymVT'}).find('a')['href']
Try (Extracting the href from 'Read more' button)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.binance.com/en/blog')
soup = BeautifulSoup(r.text, "html.parser")
div = soup.find('div', attrs={'class': 'read-btn sc-62mpio-0 iIymVT'})
print(div.find('a')['href'])
You can assess the situation inside the loop and break when you find a satisfactory result.
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
if before_set13 != '/en/blog':
break
print('skipping ' + before_set13)
print('grab ' + before_set13)
Output of the code with these changes:
skipping /en/blog
grab /en/blog/317619349105270784/Trust-Wallet-Now-Supports-Stellar-Lumens-4-More-Tokens
Use the class name with class css selector (.) for the content section then descendant combinator with a type css selector to specify child a tag element. select_one returns first match
soup.select_one('.content a')['href']
Code:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.binance.com/en/blog')
soup = bs(r.content, 'lxml')
link = soup.select_one('.content a')['href']
print('https://www.binance.com' + link)
Thanks for attention!
I'm trying to retrieve the href of products in search result.
For example this page:
However When I narrow down to the product image class, the retrived href are image links....
Can anyone solve that? Thanks in advance!
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
content = urllib2.urlopen(url).read()
content = preprocess_yelp_page(content)
soup = BeautifulSoup(content)
content = soup.findAll('div',{'class':'content dynamic'})
draft = str(content)
soup = BeautifulSoup(draft)
items = soup.findAll('div',{'class':'cell_section1'})
draft = str(items)
soup = BeautifulSoup(draft)
content = soup.findAll('div',{'class':'product-image'})
draft = str(content)
soup = BeautifulSoup(draft)
You don't need to load the content of each found tag with BeautifulSoup over and over again.
Use CSS selectors to get all product links (a tag under a div with class="product-image")
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
soup = BeautifulSoup(urllib2.urlopen(url))
for link in soup.select('div.product-image > a:nth-of-type(1)'):
print link.get('href')
Prints:
http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371
http://www.homedepot.com/p/Husky-26-in-6-Drawer-Chest-and-Cabinet-Combo-Black-C-296BF16/203420937
http://www.homedepot.com/p/Husky-52-in-18-Drawer-Tool-Chest-and-Cabinet-Set-Black-HOTC5218B1QES/204825971
http://www.homedepot.com/p/Husky-26-in-4-Drawer-All-Black-Tool-Cabinet-H4TR2R/204648170
...
div.product-image > a:nth-of-type(1) CSS selector would match every first a tag directly under the div with class product-image.
To save the links into a list, use a list comprehension:
links = [link.get('href') for link in soup.select('div.product-image > a:nth-of-type(1)')]
I would like to get the links to all of the elements in the first column in this page (http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama).
I am comfortable using BeautifulSoup, but it seems less well-suited to this task (I've been trying to access the first child of the contents of each tr but that hasn't been working so well).
The xpaths follow a regular pattern, the row number updating for each new row in the following expression:
xpath = '//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a'
Would someone help me by posting a means of iterating through the rows to get the links?
I was thinking something along these lines:
urls = []
while counter < 100:
urls.append(get the xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a'))
counter += 1
Thanks!
Here's the example on how you can get all of the links from the first column:
from lxml import etree
import requests
URL = "http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama"
response = requests.get(URL)
parser = etree.HTMLParser()
tree = etree.fromstring(response.text, parser)
for row in tree.xpath('//*[#id="mw-content-text"]/table[1]/tr'):
links = row.xpath('./td[1]/a')
if links:
link = links[0]
print link.text, link.attrib.get('href')
Note, that, tbody is appended by the browser - lxml won't see this tag (just skip it in xpath).
Hope that helps.
This should work:
from lxml import html
urls = []
parser = html.parse("http://url/to/parse")
for element in parser.xpath(your_xpath_query):
urls.append(element.attrib['href'])
You could also access the href attribute in the XPath query directly, e.g.:
for href in parser.xpath("//a/#href"):
urls.append(href)
The page you linked to does not seem to have content at the XPath you specified. Here is a different XPath which does the job:
import urllib2
import lxml.html as LH
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', "Mozilla/5.0")]
url = 'http://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'
xpath = '//table[#class="wikitable sortable"]//tr/td[1]/a/#href'
doc = LH.parse(opener.open(url))
urls = doc.xpath(xpath)
print(urls)
Maybe you are looking to something like
urls = []
while True:
try:
counter = len(urls)+1
(node,) = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[' + str(counter) + ']/td[1]/a')
urls.append(node)
except ValueError:
break
On the link that contains 'alpha' in the URL has many links(hrefs) which I would like to collect from 20 different pages and paste onto the end of the general url(second last line). The href are found in a table which class is mys-elastic mys-left for the td and the a is obviously the element which contains the href attribute. Any help would greatly be appreciated for I have been working at this for about a week.
for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=' + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')
for element in href_element:
# Convert HTMl to lxml Object
href = href_element.get('href')
print href
page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
print page_html
No need to muck about with javascript - it's all there in the html:
import scraperwiki
import lxml.html
html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm? alpha=%40&type=alpha&page=1')
root = lxml.html.fromstring(html)
# get the links
hrefs = root.xpath('//td[#class="mys-elastic mys-left"]/a')
for href in hrefs:
print 'http://ahr13.mapyourshow.com' + href.attrib['href']
import lxml.html as lh
from itertools import chain
URL = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='
BASE = 'http://ahr13.mapyourshow.com'
path = '//table[2]//td[#class="mys-elastic mys-left"]//#href'
results = []
for i in range(1,21):
doc=lh.parse(URL+str(i))
results.append(BASE+i for i in doc.xpath(path))
print list(chain(*results))