Grabbing first link in this code with Python - python

Hello this is the code I want to grab first link from using BeautifulSoup.
view-source:https://www.binance.com/en/blog
I want to grab the first article here so it would be "Trust Wallet Now Supports Stellar Lumens, 4 More Tokens"
I am trying to use Python for this.
I use this code but it grabs all the links, I only want first one to grab
with open('binanceblog1.html', 'w') as article:
before13 = requests.get("https://www.binance.com/en/blog", headers=headers2)
data1b = before13.text
xsoup2 = BeautifulSoup(data1b, "lxml")
for div in xsoup2.findAll('div', attrs={'class':'title sc-0 iaymVT'}):
before_set13 = div.find('a')['href']
How can I do this?

Most simple solution I can think at this moment that works with your code is to use break, this is because of findAll
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
break
For just the first element you can use find
before_set13 = soup.find('div', attrs={'class':'title sc-62mpio-0 iIymVT'}).find('a')['href']

Try (Extracting the href from 'Read more' button)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.binance.com/en/blog')
soup = BeautifulSoup(r.text, "html.parser")
div = soup.find('div', attrs={'class': 'read-btn sc-62mpio-0 iIymVT'})
print(div.find('a')['href'])

You can assess the situation inside the loop and break when you find a satisfactory result.
for div in xsoup2.findAll('div', attrs={'class':'title sc-62mpio-0 iIymVT'}):
before_set13 = div.find('a')['href']
if before_set13 != '/en/blog':
break
print('skipping ' + before_set13)
print('grab ' + before_set13)
Output of the code with these changes:
skipping /en/blog
grab /en/blog/317619349105270784/Trust-Wallet-Now-Supports-Stellar-Lumens-4-More-Tokens

Use the class name with class css selector (.) for the content section then descendant combinator with a type css selector to specify child a tag element. select_one returns first match
soup.select_one('.content a')['href']
Code:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.binance.com/en/blog')
soup = bs(r.content, 'lxml')
link = soup.select_one('.content a')['href']
print('https://www.binance.com' + link)

Related

Which element to use in Selenium?

I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)

Extract CSS Selector for an element with Selenium

For my project I need to extract the CSS Selectors for a given element that I will find through parsing. What I do is navigate to a page with selenium and then with python-beautiful soup I parse the page and find if there are any elements that I need the CSS Selector of.
For example I may try to find any input tags with id "print".
soup.find_all('input', {'id': 'print')})
If I manage to find such an element I want to fetch its extract it's CSS Selector, something like "input#print". I don't just find using id's but also a combination of classes and regular expressions.
Is there any way to achieve this?
Try this.
from scrapy.selector import Selector
from selenium import webdriver
link = "https://example.com"
xpath_desire = "normalize-space(//input[#id = 'print'])"
path1 = "./chromedriver"
driver = webdriver.Chrome(executable_path=path1)
driver.get(link)
temp_test = driver.find_element_by_css_selector("body")
elem = temp_test.get_attribute('innerHTML')
value = Selector(text=elem).xpath(xpath_desire).extract()[0]
print(value)
Ok, I am totally new to Python so i am sure that there is a better answer for this, but here's my two cents :)
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/49168556/extract-css-selector-for-
an-element-with-selenium"
element = 'a'
idName = 'nav-questions'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all(element, id = idName)
if tags:
for tag in tags :
getClassNames = tag.get('class')
classNames = ''.join(str('.' + x) for x in getClassNames)
print element + '#' + idName + classNames
else:
print ':('
This would print something like:
a#nav-questions.-link.js-gps-track

Getting a specific div tag with Beautful soup

Usually I would just call the div by a class name but it's not unique. The only unique thing the div tag has is the word "data-sc-replace" right after div. This is a shorten example of the source code
<div data-sc-replace data-sc-slot="1234" class = "inlineblock" data-sc-params="{'magnet': 'magnet:?......'extension': 'epub', 'stream': '' }"></div>
How would I go about calling the word "data-sc-replace" if it's not attached to a class or an id?
This is the code I have
import requests
from bs4 import BeautifulSoup
url_to_scrape = "http://example.com"
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html5lib")
list = soup.findAll('div', {'class':'inlineblock'})
print(list)
# list = soup.findAll("div", "data-sc-params")
# list = soup.find('data-sc-replace')
# list = soup.find('data-sc-params')
# list = soup.find('div', {'class':'inlineblock'}, 'data-sc-params')
Use CSS query selectors. Finds all divs with data-sc-replace attributes.
result = soup.select('div[data-sc-replace]')
That distinctive mark seems to be an HTML attribute without value. So try this:
soup.find('div', attrs = {'data-sc-replace': ''})
# or use find_all() to get all such div containers

Webcrawler BeautifulSoup - how do I get titles from links without class tags

The site I am trying to gather data from is http://www.boxofficemojo.com/yearly/chart/?yr=2015&p=.htm. Right now I want to get all the titles of the movies on this page and later move onto the rest of the data (studio, etc.) and additional data inside each of the links. This is what I have so far:
import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2015&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
title = link.string
print title
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.text
trade_spider(1)
I section I am having trouble with is
for link in soup.findAll('a', {'div':'body'}):
href = 'http://www.boxofficemojo.com' + link.get('href')
The issue is that on the site, there's no identifying class in which all the links are part of. The links just have an "< ahref > " tag.
How can I get all the titles of the links on this page?
One possible way is using .select() method which accept CSS selector parameter :
for link in soup.select('td > b > font > a[href^=/movies/?]'):
......
......
brief explanation about CSS selector being used :
td > b : find all td element, then from each td find direct child b element
> font : from filtered b elements, find direct child font element
> a[href^=/movies/?] : from filtered font elements, return direct child a element having href attribute value starts with "/movies/?"
Sorry for not giving a full answer, but heres a clue.
I have a made up name for these problems in scraping.
When I use the find(), find_all() methods I call this Abstract Identification since you could get random data when tag class/id names are not data oriented.
Then theres Nested Identification. That's when you have to find data not using the find(), find_all() methods, and instead literally crawl through a nest of tags. This requires more proficiency in BeautifulSoup.
Nested Identification is a longer proccess that's generally messy but is sometimes the only solution.
So how to do it? When you have hold of a <class 'bs4.element.Tag'> object you can locate tags that are stored as attributes of the tag object.
from bs4 import element, BeautifulSoup as BS
html = '' +\
'<body>' +\
'<h3>' +\
'<p>Some text to scrape</p>' +\
'<p>Some text NOT to scrape</p>' +\
'</h3>' +\
'\n\n' +\
'<strong>' +\
'<p>Some more text to scrape</p>' +\
'\n\n' +\
'Some Important Link' +\
'</strong>' +\
'</body>'
soup = BS(html)
# Starting point to extract a link
h3_tag = soup.find('h3') # finds the first h3 tag in the soup object
child_of_h3__p = h3_tag.p # locates the first p tag in the h3 tag
# climbing in the nest
child_of_h3__forbidden_p = h3_tag.p.next_sibling
# or
#child_of_h3__forbidden_p = child_of_h3__p.next_sibling
# sometimes `.next_sibling` will yield '' or '\n', think of this element as a
# tag separator in which case you need to continue using `.next_sibling`
# to get past the separator and onto the tag.
# Grab the tag below the h3 tag, which is the strong tag
# we need to go up 1 tag, and down 2 from our current object.
# (down 2 so we skip the tag_seperator)
tag_below_h3 = child_of_h3__p.parent.next_sibling.next_sibling
# Heres 3 different ways to get to the link tag using Nested Identification
# 1.) getting a list of childern from our object
childern_tags = tag_below_h3.contents
p_tag = childern_tags[0]
tag_separator = childern_tags[1]
a_tag = childern_tags[2] # or childrent_tags[-1] to get the last tag
print (a_tag)
print '1.) We Found the link: %s' % a_tag['href']
# 2.) Theres only 1 <a> tag, so we can just grab it directly
a_href = tag_below_h3.a['href']
print '\n2.) We Found the link: %s' % a_href
# 3.) using next_sibling to crawl
tag_separator = tag_below_h3.p.next_sibling
a_tag = tag_below_h3.p.next_sibling.next_sibling # or tag_separator.next_sibling
print '\n3.) We Found the link: %s' % a_tag['href']
print '\nWe also found a tag seperator: %s' % repr(tag_separator)
# our tag seperator is a NavigableString.
if type(tag_separator) == element.NavigableString:
print '\nNavigableString\'s are usually plain text that reside inside a tag.'
print 'In this case however it is a tag seperator.\n'
Now If I remember right, accessing a certain tag or a tag seperator, will change the object from a Tag to a NavigableString in which case you need to pass it through BeautifulSoup to be able to use methods such as find(). To check for this you can do as so.
from bs4 import element, BeautifulSoup
# ... Do some beautiful soup data mining
# reach a NavigableString object
if type(formerly_a_tag_obj) == element.NavigableString:
formerly_a_tag_obj = BeautifulSoup(formerly_a_tag_obj) # is now a soup

Beautifulsoup to retrieve the href list

Thanks for attention!
I'm trying to retrieve the href of products in search result.
For example this page:
However When I narrow down to the product image class, the retrived href are image links....
Can anyone solve that? Thanks in advance!
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
content = urllib2.urlopen(url).read()
content = preprocess_yelp_page(content)
soup = BeautifulSoup(content)
content = soup.findAll('div',{'class':'content dynamic'})
draft = str(content)
soup = BeautifulSoup(draft)
items = soup.findAll('div',{'class':'cell_section1'})
draft = str(items)
soup = BeautifulSoup(draft)
content = soup.findAll('div',{'class':'product-image'})
draft = str(content)
soup = BeautifulSoup(draft)
You don't need to load the content of each found tag with BeautifulSoup over and over again.
Use CSS selectors to get all product links (a tag under a div with class="product-image")
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
soup = BeautifulSoup(urllib2.urlopen(url))
for link in soup.select('div.product-image > a:nth-of-type(1)'):
print link.get('href')
Prints:
http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371
http://www.homedepot.com/p/Husky-26-in-6-Drawer-Chest-and-Cabinet-Combo-Black-C-296BF16/203420937
http://www.homedepot.com/p/Husky-52-in-18-Drawer-Tool-Chest-and-Cabinet-Set-Black-HOTC5218B1QES/204825971
http://www.homedepot.com/p/Husky-26-in-4-Drawer-All-Black-Tool-Cabinet-H4TR2R/204648170
...
div.product-image > a:nth-of-type(1) CSS selector would match every first a tag directly under the div with class product-image.
To save the links into a list, use a list comprehension:
links = [link.get('href') for link in soup.select('div.product-image > a:nth-of-type(1)')]

Categories