Webscraping an alternate version/hidden item from a webpage using python beautifulsoup - python

I am trying to webscrape information (points scored, tackles made, time played, position, etc...) about top14 rugby players from a website.
For each player I get info from this page :
http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon
For each player, I can get info for the 2015-2016 season easily, but I also need info for the 2014-2015 season.
Problem is, when I open the corresponding link (http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon#season=14535) the source code is the same and the info my program scrape is the 2015-2016 data.
I can't seem to find a way to get the info for previous seasons even though it appears on the webpage.
Anyone knows how to solve this ?
Here is my code for the player I gave as an example.
import bs4
from lxml import html
import requests
import string
import _pickle as pickle
from bs4 import BeautifulSoup
dic={}
url_player='http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon'
page = requests.get(url_player)
html=page.content
parsed_html = BeautifulSoup(html)
body=parsed_html.body
saison14_15=body.find('a',attrs={'data-title':'Saison 2014-2015'})
link=saison14_15['href']
url_season='http://www.lnr.fr/rugby-top-14/joueurs/nicholas-abendanon'+link
page_season = requests.get(url_season)
html_season=page_season.content
parsed_html_season = BeautifulSoup(html_season)
body_season=parsed_html_season.body
dic['nom']=body_season.find('h1',attrs={'id':'page-title'}).text
dic[body_season.find('span',attrs= {'class':'title'}).text]=body_season.find('span',attrs={'class':'text'}).text
info1=body_season.find('ul',attrs={'class':'infos-list small-bold'})
try:
for item in info1.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
info2=body_season.find('ul',attrs={'class':'fluid-block-grid-3 team-stats'})
if info2 is not None :
for item in info2.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
info3=body_season.find('ul',attrs={'class':'number-list small-block-grid-2'})
if info3 is not None :
for item in info3.findAll('li'):
dic[item.find('span',attrs={'class':'title'}).text]=item.find('span',attrs={'class':'text'}).text
except:
dic=dic`

When you choose the 2014-2015 season, the page makes an AJAX request to
http://www.lnr.fr/ajax_player_stats_detail?player=33249&compet_type=1&=undefined&season=14535&_filter_current_tab_id=panel-filter-season&ajax-target-selector=%23player_stats_detail_block
If you then switch back to 2015-2016, it makes an AJAX request to
http://www.lnr.fr/ajax_player_stats_detail?player=33249&compet_type=1&=undefined&season=18505&_filter_current_tab_id=panel-filter-season&ajax-target-selector=%23player_stats_detail_block
Each request returns a chunk of HTML which gets inserted into the page.
If you can figure out the parameters needed for player and season, I suggest you request the data directly (without loading the parent page at all).

Related

Cannot select HTML element with BeautifulSoup

Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.
There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

I'm trying to scrape Gold stock ticker from Yahoo! Finance.
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get('https://finance.yahoo.com/quote/GC=F?p=GC=F')
soup = BeautifulSoup(response.text, 'lxml')
gold_price = soup.findAll("div", class_='My(6px) Pos(r) smartphone_Mt(6px)')[2].find_all('p').text
Whenever I run this it returns: list index out of range.
When I do print(len(ssoup)) it returns 4.
Any ideas?
Thank you.
You can make a direct request to the yahoo server. To locate the query URL you need to open Network tab via Dev tools (F12) -> Fetch/XHR -> find name: spark?symbols= (refresh page if you don't see any), find the needed symbol, and see the response (preview tab) on the newly opened tab on the right.
You can make direct requests to all of these links if the request method is GET since POST methods are much more complicated.
You need json and requests library, no need for bs4. Note that making a lot of such requests might block your IP (or set an IP rate limit) or you won't get any response because their system might detect that it's a bot since the regular user won't make such requests to the server, repeatedly. So you need to figure out how to bypass it.
Update:
There's possibly a hard limit on how many requests can be made in an X period of time.
Code and example in the online IDE (contains full JSON response):
import requests, json
response = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=GC%3DF&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance').text
data_1 = json.loads(response)
gold_price = data_1['spark']['result'][0]['response'][0]['meta']['previousClose']
print(gold_price)
# 1830.8
P.S. There's a blog about scraping Yahoo! Finance Home Page of mine, which is kind of relevant.

Python Web Scraping with search and non dynamic URI

I'm a begginer in the world of python and web-scrapers, i am used to make scrapers with dynamic URLs, where the URI change when i input specific parameters in the URL itself.
Ex: Wikipedia.
(if i input a search named "Stack Overflow" i will have a URI that looks like this: https://en.wikipedia.org/wiki/Stack_Overflow)
At the moment i was challenged to develop a web-scraper to collect data from this page.
The field "Texto/Termos a serem pesquisados" corresponds a search field, but when i input the search the URL stays the same not letting me to get the right HTML code for my research.
I am used to work with BeautifulSoup and Requests to do the scraping thing, but in this case it has no use, since the URL stays the same after the search.
import requests
from bs4 import BeautifulSoup
url = 'http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp'
html = requests.get(url)
bs0bj = BeautifulSoup(html.content,'html.parser')
print(bsObj)
# And from now on i cant go any further
Usually i would do something like
url = 'https://en.wikipedia.org/wiki/'
input = input('Input your search :)
search = url + input
And then do all the BeautifulSoup thing, and findAll thing to get my data from the HTML code.
I have tried to use Selenium too, but im looking for something different than that due to all the webdriver thing. With the following piece of code i have achieved to some odd results but i still cant scrape the HTML in a good way.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from bs4 import BeautifulSoup
# Acess the page and input the search on the field
driver = webdriver.Chrome()
driver.get('http://comprasnet.gov.br/acesso.asp?url=/ConsultaLicitacoes/ConsLicitacao_texto.asp')
driver.switch_to.frame('main2')
busca = driver.find_element_by_id("txtTermo")
busca.send_keys("GESTAO DE PESSOAS")
#data_inicio = driver.find_element_by_id('dt_publ_ini')
#data_inicio.send_keys("01/01/2018")
#data_fim = driver.find_element_by_id('dt_publ_fim')
#data_fim.send_keys('20/12/2018')
botao = driver.find_element_by_id('ok')
botao.click()
So given all that:
There is a way to scrape data from these static urls ?
Can i input a search in the field via code ?
Why cant i scrape the right source code ?
The problem is that your initial search page is using frames for the searching & results, which makes it harder for BeautifulSoup to work with it. I was able to obtain the search results by using a slightly different URL and MechanicalSoup instead:
>>> from mechanicalsoup import StatefulBrowser
>>> sb = StatefulBrowser()
>>> sb.open('http://comprasnet.gov.br/ConsultaLicitacoes/ConsLicitacao_texto.asp')
<Response [200]>
>>> sb.select_form() # select the search form
<mechanicalsoup.form.Form object at 0x7f2c10b1bc18>
>>> sb['txtTermo'] = 'search text' # input the text to search for
>>> sb.submit_selected() # submit the form
<Response [200]>
>>> page = sb.get_current_page() # get the returned page in BeautifulSoup form
>>> type(page)
<class 'bs4.BeautifulSoup'>
Note that the URL I'm using here is that of the frame that has the search form and not the page you provided that was inlining it. This removes one layer of indirection.
MechanicalSoup is built on top of BeautifulSoup and provides some tools for interacting with websites in a similar way to the old mechanize library.

Python3 Parse more than 30 videos at a time from youtube

I recently decided to get into parsing with python, i made up a project where i need to get data from all of a youtubers videos. I decided it would be easy to just go to the video tab in their channel and parse it for all if its links. However when i do parse it i can only get 30 videos at a time. I was wondering why this is because the link never seems to change when you load more. As well as if there was a way around it.
Here is my code
import bs4 as bs
import requests
page = requests.get("/run/media/morpheous/PORTEUS/Workspace/Python/Parsing/parse.py")
soup = bs.BeautifulSoup(page.text, 'html.parser')
soup.find_all("a", "watch-view-count")
k = soup.find_all("div", "yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2")
storage = open('data.csv', 'a')
storage.write(k.get('href')
storage.close()
Any help is appreciated, thanks
I should first say that I agree with #jonrsharpe. Using the YouTube API is the more sensible choice.
However, if you must do this by scraping, here's a suggestion.
Let's take MKBHD's videos page as an example. The Load more button at the bottom of the page has a button tag with this attribute (You can use your browser's 'inspect element' feature to see this value):
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
When you click the Load more button, it makes an AJAX request to this /browse_ajax url. The response is a JSON object that looks like this:
{
content_html: "the html for the videos",
load_more_widget_html: " \n\n\n\n \u003cbutton class=\"yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button\" type=\"button\" onclick=\";return false;\" aria-label=\"Load more\n\" data-uix-load-more-href=\"\/browse_ajax?action_continuation=1\u0026amp;continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk03Z0JBQSUzRCUzRA%253D%253D\" data-uix-load-more-target-id=\"channels-browse-content-grid\"\u003e\u003cspan class=\"yt-uix-button-content\"\u003e \u003cspan class=\"load-more-loading hid\"\u003e\n \u003cspan class=\"yt-spinner\"\u003e\n \u003cspan class=\"yt-spinner-img yt-sprite\" title=\"Loading icon\"\u003e\u003c\/span\u003e\n\nLoading...\n \u003c\/span\u003e\n\n \u003c\/span\u003e\n \u003cspan class=\"load-more-text\"\u003e\n Load more\n\n \u003c\/span\u003e\n\u003c\/span\u003e\u003c\/button\u003e\n\n\n"
}
The content_html contains the html for the new page of videos. You can parse that to get the videos in that page. To get to the next page, you need to use the load_more_widget_html value and extract the url which again looks like:
data-uix-load-more-href="/browse_ajax?action_continuation=1&continuation=4qmFsgJAEhhVQ0JKeWNzbWR1dllFTDgzUl9VNEpyaVEaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"
The only thing in that url that changes is the value of the continuation parameter. You can keep making requests to this 'continuation' url, until the returning JSON object does not have the load_more_widget_html.

Scraping website only provides a partial or random data into CSV

I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.
Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.

Categories