My goal is to scrape href links on the base_url site.
My code:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests, csv, re
game_links = []
link_pages = []
base_url = "http://www.basket.fi/sarjat/ohjelma_tulokset/?season_id=93783&league_id=4#mbt:2-303$f&stage=177155:$p&0="
browser = webdriver.PhantomJS()
browser.get(base_url)
table = BeautifulSoup(browser.page_source, 'lxml')
for game in table.find_all("a", {'game_id': re.compile('\d+')}):
href=game.get("href")
print(href)
Result:
http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4
http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4
http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4
http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4
......
The problem is that I can't understand why in the result the href links will come always two times?
As you Notice in the image there are same game_id for two links
Modified Code:
This would help you to get only one link
for game in table.find_all("a", {'game_id': re.compile('\d+')}):
if game.children:
href=game.get("href")
print(href)
Related
I am trying to webscrape the following webpage to get a specific href using BS4. They've just changed the page layout and due to that I am unable to parse it correctly. Hope anyone can help.
Webpage trying to scrape: https://www.temit.co.uk/investor/resources/temit-literature
Tag trying to get: href="https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheet"
Currently I am using the following code (BS4), however, I am unable to get the specific element anymore.
url = "https://www.temit.co.uk/investor/resources/temit-literature"
page = requests.get(url) # Requests website
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('div', attrs={'class':'row ng-star-inserted'})
url_TEM = table[1].find_all('a')[0].get('href')
url_TEM = 'https://www.temit.co.uk' + url_TEM
The url is dynamic that's why I use selenium with bs4 and getting the desired output as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = "https://www.temit.co.uk/investor/resources/temit-literature"
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'html.parser')
table_urls = soup.select('table.table-striped tbody tr td a')
for table_url in table_urls:
url = table_url['href']
print(url)
Output:
https://franklintempletonprod.widen.net/s/gd2tmrc8cl/final-temit-annual-report-31-march-2021
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://www.londonstockexchange.com/stock/TEM/templeton-emerging-markets-investment-trust-plc/analysis
https://franklintempletonprod.widen.net/s/c667rc7chx/173-fact-sheet-retail-investor-factsheethttps://franklintempletonprod.widen.net/s/bdxrtlljxg/temit_manager_update
https://franklintempletonprod.widen.net/s/6djgk6xknx/temit-holdings-report
/content-kid/kid/en-GB/KID-GB0008829292-GB-en-GB.pdf
https://franklintempletonprod.widen.net/s/flblzqxmcg/temit-investor-disclosure-document.pdf-26-04-2021
https://franklintempletonprod.widen.net/s/l5gmdbf6vp/agm-results-announcement
https://franklintempletonprod.widen.net/s/qmkphz9s5s/agm-shareholder-documentation
https://franklintempletonprod.widen.net/s/b258tgpjb7/agm-uk-voting-guide
https://franklintempletonprod.widen.net/s/c5zhbbnxql/agm-nz-voting-guide
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/shmljbvtjq/temit-annual-report-mar-2019
https://franklintempletonprod.widen.net/s/5bjq2qkmh5/temit-annual-report-mar-2018
https://franklintempletonprod.widen.net/s/bnx9mfwlzw/temit-annual-report-mar-2017
https://franklintempletonprod.widen.net/s/rfqc7xrnfn/temit-annual-report-mar-2016
https://franklintempletonprod.widen.net/s/zfzxlflxnq/temit-annual-report-mar-2015
https://franklintempletonprod.widen.net/s/dj9zl8rpcm/temit-annual-report-mar-2014
https://franklintempletonprod.widen.net/s/7xshxmkpnh/temit-annual-report-mar-2013
https://franklintempletonprod.widen.net/s/7gwx2qmcdr/temit-annual-report-mar-2012
https://franklintempletonprod.widen.net/s/drpd7gbvxl/temit-annual-report-mar-2011
https://franklintempletonprod.widen.net/s/2pb2kxkgbl/temit-annual-report-mar-2010
https://franklintempletonprod.widen.net/s/g6pdr9hq2d/temit-annual-report-mar-2009
https://franklintempletonprod.widen.net/s/7pvjf6fhl9/temit-annual-report-mar-2008
https://franklintempletonprod.widen.net/s/lzrcnmhpvr/temit-semi-annual-report-temit-semi-annual-report-30-09-2020
https://franklintempletonprod.widen.net/s/xwvrncvkj2/temit-half-year-report-sept-2019
https://franklintempletonprod.widen.net/s/lbp5ssv8mc/temit-half-year-report-sept-2018
https://franklintempletonprod.widen.net/s/hltddqhqcf/temit-half-year-report-sept-2017
https://franklintempletonprod.widen.net/s/2tlqxxflgn/temit-half-year-report-sept-2016
https://franklintempletonprod.widen.net/s/lbcgztjjkj/temit-half-year-report-sept-2015
https://franklintempletonprod.widen.net/s/2tjxzgbdvx/temit-half-year-report-sept-2014
https://franklintempletonprod.widen.net/s/gzrpjwb7bf/temit-half-year-report-sept-2013
https://franklintempletonprod.widen.net/s/lxhbdrmc8z/temit-half-year-report-sept-2012
https://franklintempletonprod.widen.net/s/zzpxrrrpmc/temit-half-year-report-sept-2011
https://franklintempletonprod.widen.net/s/zjdd2gn5jc/temit-half-year-report-sept-2010
https://franklintempletonprod.widen.net/s/7sbqfxxkrd/temit-half-year-report-sept-2009
https://franklintempletonprod.widen.net/s/pvswpqkdvb/temit-half-year-report-sept-2008-1
I'm trying to scrape a link in the video description on youtube, but the list always return empty.
I've tried to change the tag from where I'm scraping, but there is no change in either the output nor the error message.
Here's the code I'm using:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/watch?v=gqUqGaXipe8').text
soup = BeautifulSoup(source, 'lxml')
link = [i['href'] for i in soup.findAll('a', class_='yt-simple-endpoint style-scope yt-formatted-string', href=True)]
print(link)
What is wrong, and how can I solve it?
In your case, requests doesn't return the whole HTML structure of the page. If Youtube is filling in the data using JavaScript we must run it through a real browser to get the source of the page, such as Chrome Headless using the Selenium library. Here is the general solution:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
url = "https://www.youtube.com/watch?v=Oh1nqnZAKxw"
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
link = [i['href'] for i in soup.select('div#meta div#description [href]')]
print(link)
I'm looking to extract all of the brands from this page using Beautiful Soup. My program so far is:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
def main():
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://neighborhoodgoods.com/pages/brands')
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
brand = []
for tag in soup.find('table'):
brand.append(tag.contents.text)
print(brand)
browser.close()
print('This program is terminated.')
I'm struggling with figuring out the right tag to use as the data is nested in tr/td. Any advice? Thanks so much!
If I understand your question correctly, you only want to get the company name (the first <td> of each table)
Try using a CSS Selector td:nth-of-type(1) which selects the first <td> of every table.
import requests
from bs4 import BeautifulSoup
URL = "https://neighborhoodgoods.com/pages/brands"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print([tag.text for tag in soup.select("td:nth-of-type(1)")])
Output:
['A.N Other', 'Act + Acre', ...And on.. , 'Wild One']
I am trying to make a list of the links that are inside a product page.
I have multiple links through which I want to get the links of the product page.
I am just posting the code for a single link.
r = requests.get("https://funskoolindia.com/products.php?search=9723100")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='product-bg-panel', href=True):
print('href: ', a_tag['href'])
This is what it should print: https://funskoolindia.com/product_inner_page.php?product_id=1113
The site is dynamic, thus, you can use selenium
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]
Output:
['product_inner_page.php?product_id=1113']
The data are loaded dynamically through Javascript from different URL. One solution is using selenium - that executes Javascript and load links that way.
Other solution is using re module and parse the data url manually:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'
data = {'page':'1',
'sort_val':'new',
'product_view_val':'grid',
'show_list':'12',
'brand_id':'',
'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}
soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')
for a in soup.select('#list-view .product-bg-panel > a[href]'):
print('https://funskoolindia.com/' + a['href'])
Prints:
https://funskoolindia.com/product_inner_page.php?product_id=1113
try this : print('href: ', a_tag.get("href"))
and add features="lxml" to the BeautifulSoup constructor
I am trying to fetch the download CSV file link from this: https://patents.google.com/?assignee=intel
This is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://patents.google.com/?assignee=intel")
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
But last 2 lines are returning empty array. What am I missing here?
Even this is not returning anything:
soup.find(id="resultsLayout")
That's because the elements are being generated by javascript. You can use selenium to get the whole page source.
Here's an edited version of your code using selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://patents.google.com/?assignee=intel')
page = browser.page_source
browser.quit()
soup = BeautifulSoup(page, 'html.parser')
soup.find_all('a', class_='style-scope search-results')
soup.find_all('a', class_='style-scope')
Let me know if you need clarifications. Thanks!