Scrapping Table Data from website without class

Scrapping Table Data from website without class - python

I want to scrap data from the tables on this url, but I couldn't find the table class.
I tried the first steps in BeautifulSoup but I couldn't get any further.
from pandas .io.html import read_html
import requests
from bs4 import BeautifulSoup
url = 'https://www.cbe.org.eg/en/Auctions/Pages/AuctionsEGPTBillsHistorical.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# No class
bills_table = soup.find('table', class_= )
I appreciate any help.
Thanks.

When clicking on that link, a page is displayed containing a date range with the ability to show a table (which I assume is the table you are after) upon a click. You have 2 options here:
Simulate the website using a package such as selenium
Send the post request yourself directly
There is documentation available on selenium which should help. I will describe 2 in more detail.
Open the website linked and open the dev panel on whichever browser you are using and navigate to the 'network' tab. Enter a date range and press the show online button as required. A new entry will appear, which is the request we want to make. Click on it and create a new requests.post(..) request with the listed request headers and the body. You may want to edit the body to change the date range.

Related

Cannot select HTML element with BeautifulSoup

Novice web scraper here:
I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.
Much appreciation in advance!
from splinter import Browser
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
from webdriver_manager.chrome import ChromeDriverManager
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem
OR
# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem
OR when I try:
browser.find_by_css('td.DataletData')[15]
it returns:
<splinter.driver.webdriver.WebDriverElement at 0x11a763160>
and I can't pull the html contents from that element.

There's a few issues I see, but it could be that you didn't include your code as you actually have it.
Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.
One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop
On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:
Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1
I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.
So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.
Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.

Not able to get updated data from web page using BeautifulSoup

import requests
URL = 'https://www.moneycontrol.com/india/stockpricequote/cigarettes/itc/ITC'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')
# time.sleep(5)
var1 = float(soup.find('td', attrs={'class': 'espopn'}).get_text().replace(",",""))
With this code, I am able to the value of var1, but the web page which I am accessing not showing real-time data once we land on the web page, it took 1 sec to update the real-time value once we land on the web page.
Due to which the value that I am getting in var1 is not a real-time value.
Wanted to know how I can wait once I land on the web page before doing web scraping.
Thanks in Advance.

1.As Data is updating dynamic so hard to get from bs4 so you can try from api itself so how to find it
2.Go to chrome developer mode and then Network tab find xhr and now reload your website under Name tab you will find links but there are lot of
3.But on left side there is search so you can search price and from it gives url and you click on that go to headers copy that url and make call using requests module
import requests
from bs4 import BeautifulSoup
res=requests.get("https://api.moneycontrol.com/mcapi/v1/stock/get-stock-price?scIdList=ITC%2CVST%2CGPI%2CIWP540954%2CGTC&scId=ITC")
main_data=res.json()
main_data['data'][0]
Output:
{'companyName': 'ITC',
'lastPrice': '215.25',
'perChange': '-0.62',
'marketCap': '264947.87',
'scTtm': '19.99',
'perform1yr': '7.33',
'priceBook': '4.16'}
Image:

How do I get a list of redirect urls from Dell.com

I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks

This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.

try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping

How web scrape with request, Bs4 when there is a script result?

I am trying to get some data from this website:
http://www.espn.com.br/futebol/resultados/_/liga/BRA.1/data/20181018
When I inspect the page on my browser I can see all the values I need on the HTML. I want to fetch the game result and the players names (for each date, in this example 2018-10-18)
On no game days the website shows:
"Sem jogos nesta data", which is it easy to find on browser inspection:
But when using
url = 'http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018'
page = requests.get(url, "lxml")
The output is basically the website where I can't find the phrase "Sem jogos nesta data"
How can I get fetch the HTML containing the script results? Is it possible with request? urllib?

Looks like the data you are looking for that comes from their backend API. I would use selenium-python package instead of requests.
Here is example:
driver = webdriver.Firefox()
driver.get("http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018")
value = driver.find_elements(By.XPATH, '//*[#id="events"]/div')
drive.close()
I didn't check the code but it should be working

Scrape "Script part" of a javascript rendered website in Python

I am making a project where I want to see the average karma of users on various subreddits on Reddit. As such I am in the process of scraping users karma, which is proving a bit difficult with the new reddit structure.
I am not able to use PRAW as the karma figures there are not correct.
According to the page source of a users all I need is to find the following two variables: commentKarma and postKarma. Both of these variables are found under the "" section, see example here view-source:https://www.reddit.com/user/loganb3171. However, when I use selenium page_source or beautifulsoup they do not show up.
I have been working on this problem for a couple of hours now and I am nowhere near it.
Any and all help is appreciated.
either of these snippets does not give me the entire pagesource as you get when right clicking "view page source"
source_var = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
source_var=driver.page_source

Okay, so I see that you're using selenium from the snippet in the question. If that's the case, then there's no way to set request headers with the web driver. Reddit will know you are a bot.
If you only need the page source, you can use requests to get the page and open it with selenium or use BeautifulSoup to parse the page
from bs4 import BeautifulSoup
import requests
url = "https://www.reddit.com/user/loganb3171"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.