How to get the href value of a link with bs4? - python

I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results

You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]

From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")

Related

Which element to use in Selenium?

I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)

Struggling to extract part of beautiful soup result set

I am trying to extract a url from a bs4.element.ResultSet. I have distilled the result set down to the following.
dirty_true_url = [<meta content="https://escapefromtarkov.gamepedia.com/AK-103_7.62x39_assault_rifle" property="og:url"/>]
Obviously, the url is present in the meta content tag, and is a 'pedia.com page.
In case it is required, this is the type of object I am handling:
type(dirty_true_url) #bs4.element.ResultSet
type(dirty_true_url[0]) #bs4.element.Tag
I have tried to obtain the url using variations of the following:
print(dirty_true_url('content'))
print(dirty_true_url[0]('content'))
print(dirty_true_url[0]('meta', {'content'}))
print(dirty_true_url[0]('meta', {'content' : ''}))
print(dirty_true_url[0]('meta', {'content' : 'https://'}))
print(dirty_true_url[0]('meta', {'content' : 'https://.*'}))
from urllib.parse import parse_qs, urlparse
parameters = parse_qs(urlparse(dirty_true_url).query)
print( parameters['https'][0] )
How can I extract the content/url from my soup?
If necessary, the following code can be used to obtain dirty_true_url:
#modules
import requests
from bs4 import BeautifulSoup
from io import StringIO
#body of code that gets the page
pseudo_temp_weapon_page_url = 'https://escapefromtarkov.gamepedia.com/index.php?title=AK-103&redirect=yes'
pseudo_temp_weapon_page = requests.get(pseudo_temp_weapon_page_url)
pseudo_temp_weapon_soup = BeautifulSoup(pseudo_temp_weapon_page.content, 'html.parser')
pseudo_temp_weapon_soup_as_string = StringIO(str(pseudo_temp_weapon_soup))
pseudo_temp_weapon_soup_parsed = BeautifulSoup\
(pseudo_temp_weapon_soup_as_string, 'html.parser')
#Snippet of html containing URL
dirty_true_url = pseudo_temp_weapon_soup_parsed.find_all(property="og:url")
You can try this :
print(dirty_true_url[0].get('content'))
You can access a tag’s attributes by treating the tag like a dictionary. docs
dirty_true_url = pseudo_temp_weapon_soup_parsed.find_all(property="og:url")
print(dirty_true_url[0]['content'])
output:
'https://escapefromtarkov.gamepedia.com/AK-103_7.62x39_assault_rifle'

Using startswith function to filter a list of urls

I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?
Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)
Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer

Extract CSS Selector for an element with Selenium

For my project I need to extract the CSS Selectors for a given element that I will find through parsing. What I do is navigate to a page with selenium and then with python-beautiful soup I parse the page and find if there are any elements that I need the CSS Selector of.
For example I may try to find any input tags with id "print".
soup.find_all('input', {'id': 'print')})
If I manage to find such an element I want to fetch its extract it's CSS Selector, something like "input#print". I don't just find using id's but also a combination of classes and regular expressions.
Is there any way to achieve this?
Try this.
from scrapy.selector import Selector
from selenium import webdriver
link = "https://example.com"
xpath_desire = "normalize-space(//input[#id = 'print'])"
path1 = "./chromedriver"
driver = webdriver.Chrome(executable_path=path1)
driver.get(link)
temp_test = driver.find_element_by_css_selector("body")
elem = temp_test.get_attribute('innerHTML')
value = Selector(text=elem).xpath(xpath_desire).extract()[0]
print(value)
Ok, I am totally new to Python so i am sure that there is a better answer for this, but here's my two cents :)
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/49168556/extract-css-selector-for-
an-element-with-selenium"
element = 'a'
idName = 'nav-questions'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tags = soup.find_all(element, id = idName)
if tags:
for tag in tags :
getClassNames = tag.get('class')
classNames = ''.join(str('.' + x) for x in getClassNames)
print element + '#' + idName + classNames
else:
print ':('
This would print something like:
a#nav-questions.-link.js-gps-track

python passing argument containing quote

I'm learning to scrape text from the web. Ive written the following function
from bs4 import BeautifulSoup
import requests
def get_url(source_url):
r = requests.get(source_url)
data = r.text
#extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
#get H3 tags with class ...
h3list = soup.findAll("h3", { "class" : "entry-title td-module-title" })
#create data structure to store links in
ulist = []
#pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
I am calling this from a separate file...
from print1 import get_url
ulist = get_url("http://www.startupsmart.com.au/")
print(ulist[3])
The problem is that the css selector I am using is quite unique to the site I am parsing. So the function is a bit 'brittle'. I want to pass the css selector as an argument to the function
If I add a parameter to the function definition
def get_url(source_url, css_tag):
and try to pass "h3", { "class" : "entry-title td-module-title" }
it spazzes out
TypeError: get_url() takes exactly 1 argument (2 given)
I tried escaping all the quotes but it still doesn't work.
I'd really appreciate some help. I can't find a previoud answer to this one.
Here's a version that works:
from bs4 import BeautifulSoup
import requests
def get_url(source_url, tag_name, attrs):
r = requests.get(source_url)
data = r.text
# extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
# get H3 tags with class ...
h3list = soup.findAll(tag_name, attrs)
# create data structure to store links in
ulist = []
# pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
ulist = get_url("http://www.startupsmart.com.au/", "h3", {"class": "entry-title td-module-title"})
print(ulist[3])

Categories