I'm trying to scrape the values of the crash numbers off of csgocrash.com, but beautifulsoup can't seem to load the whole webpage. Is this because it doesn't load until after it says game loading..? And how would I fix that. Once I can navigate through to the history tab of the website in beautifulsoup I think I will be able to accomplish my goal.
Related
I've been trying to figure this out but with no luck. I found a thread (How to scrape data from flexbox element/container with Python and Beautiful Soup) that I thought would help but I can't seem to make any headway.
The site I'm trying to scrape is...http://www.northwest.williams.com/NWP_Portal/. In particular I want to get the data from the tab/frame of 'Storage Levels' but for the life of me I can't seem to navigate to the right spot to get the data. I've tried various iterations of the code below with no success. I've changed 'lxml' to 'html.parser', looked for tables, looked for 'tr' etc but the code always returns empty. I've also tried looking at the network info but when I click on any of the tabs (System Status, PAL/System Balancing etc) I don't see any change in network activity. I'm sure it's something simple that I'm overlooking but I just can't put my finger on it.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.northwest.williams.com/NWP_Portal/'
r = requests.get(url)
html = soup(r.content,'lxml')
page = html.findAll('div',{'class':'dailyOperations-panels'})
How can I 'navigate' to the 'Storage Levels' frame/tab? What is the html that I'm actually looking for? Can I do this with just requests and beautiful soup? I'm not opposed to using Selenium but I haven't used it before and would prefer to just use requests and BeautifulSoup if possible.
Thanks in advance!
Hey so what I notice is your are trying to get "dailyOperations-panels" from a div which won't work.
I'm trying to download a file using Selenium and BeautifulSoup, but am running into some issues with the way the website is set up. I can see there is a table object containing the link I want deep in the code, but I'm running into difficulties actually instructing BeautifulSoup and Selenium to actually navigate that far and find the link. https://www.theice.com/clear-us/risk-management#margin-rates is the website and I want to download the Margin Scanning File.
hdr={'User-Agent':'Mozilla/5.0'}
req=urllib.request.Request(url,headers=hdr)
icepage=urllib.request.urlopen(req)
htmlitem=icepage.read()
soup=BeautifulSoup(htmlitem,'lxml')
divs=soup.find('div',{'class':'sticky-header__main'})
print(divs.findChild().find('div',{'class':'row'}).find('div',{'class':'1-main true-grid-10'}).find_all('div')[2])
From there divs.findChild().find('div',{'class':'row'}).find('div',{'class':'1-main true-grid-10'}).find_all('div')[2] is the closest I have gotten to selecting the next div that has id='content-5485eefe-b105-49ed-b1ac-7e9470d29262' and I want to drill down that to the ICUS_MARGIN_SCANNING csv in the table five or six further div levels below that.
With Selenium I'm even further lost where I've been trying variations of driver.find_element_by_link_text('Margin Scanning') and getting nothing back.
Any help with accessing that table and the ICUS_Margin_scanning file would be much appreciated. Thank you!
used f12=> network tab and found this page so here u go
from bs4 import BeautifulSoup
import requests
import datetime
BASE_API_URL='https://www.theice.com'
r=requests.get(f'https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Current&_={int(datetime.datetime.now().timestamp()*1000)}')
soup=BeautifulSoup(r.content,features='lxml')
margin_scanning_link=BASE_API_URL+soup.find_all("a", string="Margin Scanning")[0].attrs['href']
margin_scanning_file=requests.get(margin_scanning_link)
Im currently using beautiful soup to try and webscrape a website for data however the python module is reading the source code of the page. In the source code of the page the information i need isn't there however if i right click on the page in chrome and inspect element it is.
i was wondering if there was any way a python module could scrape the elements from a webpage and not the source code
In beautiful soup ive tried to search for the elements like however they just dont come up or appear because its searching in the source code. Im also not sure why or how it doesnt appear there.
When the contents are loaded by JavaScript, you can not get the data via Beautiful Soup. In this situation, the Selenium library is used as it is more useful and handy to extract the required dynamic contents.
I am trying to make a coronavirus tracker using beautifulsoup just for some practice.
my code is,
import requests
from bs4 import BeautifulSoup
page=requests.get("https://sample.com")
soup=BeautifulSoup(page.content,'html.parser')
table=soup.find("div",class_="ZDcxi")
print(table)
In the output its showing none, but the div tag with the class ZDcxi do have content.
please help
The data, which you see in the browser, and includes the target div, is dynamic content, generated by scripts included with the page and run in the browser. If you just search for the class name in page.content, you will find it is not there.
What many people do is use selenium to open desired pages through Chrome (or another web browser), and then, after the page finishes loading and generating dynamic content, use BeautifulSoup to harvest the content from the browser, and continue processing from there.
Find out more at Requests vs Selenium Python, and also when you search selenium vs requests/
I am currently using Selenium to crawl data from some websites. Unlike urllib, it seems that I do not really need a parser like BeautifulSoup to parse the HTML. I can simply find an element with Selenium and use Webelement.text to get the data that I need. As I saw there are some people using Selenium and BeautifulSoup together in web crawling. Is it really necessary? Any special features that bs4 can offer to improve the crawling process? Thank you.
Selenium itself is quite powerful in terms of locating elements and, it basically has everything you need for extracting data from HTML. The problem is, it is slow. Every single selenium command goes through the JSON wire HTTP protocol and there is a substantial overhead.
In order to improve the performance of the HTML parsing part, it is usually much faster to let BeautifulSoup or lxml parse the page source retrieved from .page_source.
In other words, a common workflow for a dynamic web page is something like:
open the page in a browser controlled by selenium
make the necessary browser actions
once the desired data is on the page, get the driver.page_source and close the browser
pass the page source to an HTML parser for further parsing