I am trying to get some data from this website:
http://www.espn.com.br/futebol/resultados/_/liga/BRA.1/data/20181018
When I inspect the page on my browser I can see all the values I need on the HTML. I want to fetch the game result and the players names (for each date, in this example 2018-10-18)
On no game days the website shows:
"Sem jogos nesta data", which is it easy to find on browser inspection:
But when using
url = 'http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018'
page = requests.get(url, "lxml")
The output is basically the website where I can't find the phrase "Sem jogos nesta data"
How can I get fetch the HTML containing the script results? Is it possible with request? urllib?
Looks like the data you are looking for that comes from their backend API. I would use selenium-python package instead of requests.
Here is example:
driver = webdriver.Firefox()
driver.get("http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018")
value = driver.find_elements(By.XPATH, '//*[#id="events"]/div')
drive.close()
I didn't check the code but it should be working
Related
I want to scrap data from the tables on this url, but I couldn't find the table class.
I tried the first steps in BeautifulSoup but I couldn't get any further.
from pandas .io.html import read_html
import requests
from bs4 import BeautifulSoup
url = 'https://www.cbe.org.eg/en/Auctions/Pages/AuctionsEGPTBillsHistorical.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# No class
bills_table = soup.find('table', class_= )
I appreciate any help.
Thanks.
When clicking on that link, a page is displayed containing a date range with the ability to show a table (which I assume is the table you are after) upon a click. You have 2 options here:
Simulate the website using a package such as selenium
Send the post request yourself directly
There is documentation available on selenium which should help. I will describe 2 in more detail.
Open the website linked and open the dev panel on whichever browser you are using and navigate to the 'network' tab. Enter a date range and press the show online button as required. A new entry will appear, which is the request we want to make. Click on it and create a new requests.post(..) request with the listed request headers and the body. You may want to edit the body to change the date range.
let me briefly describe the problem. When I use urllib3 to scrape the html from a website, it isn't the same as the html code that I get when I manually enter the website with chrome and use 'inspect element'
Here is an example from my code. The problem is that the html code I got here is different from the html code I would get when I use inspect element on chrome
#myUrl is the url of the website I'm trying to scrape
http = urllib3.PoolManager()
response = http.request('GET', myUrl)
soup = BeautifulSoup(response.data.decode('utf-8'), features="html.parser")
m = str(soup)
that problem, probably is due to: the content on the page is being loaded with javascript. To get the whole data, you have to use some library that runs javascript. I recommend using Selenium.
To verify that case, you can disable the browser's javascript and trying to load the page.
I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks
This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.
try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping
I am making a project where I want to see the average karma of users on various subreddits on Reddit. As such I am in the process of scraping users karma, which is proving a bit difficult with the new reddit structure.
I am not able to use PRAW as the karma figures there are not correct.
According to the page source of a users all I need is to find the following two variables: commentKarma and postKarma. Both of these variables are found under the "" section, see example here view-source:https://www.reddit.com/user/loganb3171. However, when I use selenium page_source or beautifulsoup they do not show up.
I have been working on this problem for a couple of hours now and I am nowhere near it.
Any and all help is appreciated.
either of these snippets does not give me the entire pagesource as you get when right clicking "view page source"
source_var = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
source_var=driver.page_source
Okay, so I see that you're using selenium from the snippet in the question. If that's the case, then there's no way to set request headers with the web driver. Reddit will know you are a bot.
If you only need the page source, you can use requests to get the page and open it with selenium or use BeautifulSoup to parse the page
from bs4 import BeautifulSoup
import requests
url = "https://www.reddit.com/user/loganb3171"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).