I'm trying to download a file using Selenium and BeautifulSoup, but am running into some issues with the way the website is set up. I can see there is a table object containing the link I want deep in the code, but I'm running into difficulties actually instructing BeautifulSoup and Selenium to actually navigate that far and find the link. https://www.theice.com/clear-us/risk-management#margin-rates is the website and I want to download the Margin Scanning File.
hdr={'User-Agent':'Mozilla/5.0'}
req=urllib.request.Request(url,headers=hdr)
icepage=urllib.request.urlopen(req)
htmlitem=icepage.read()
soup=BeautifulSoup(htmlitem,'lxml')
divs=soup.find('div',{'class':'sticky-header__main'})
print(divs.findChild().find('div',{'class':'row'}).find('div',{'class':'1-main true-grid-10'}).find_all('div')[2])
From there divs.findChild().find('div',{'class':'row'}).find('div',{'class':'1-main true-grid-10'}).find_all('div')[2] is the closest I have gotten to selecting the next div that has id='content-5485eefe-b105-49ed-b1ac-7e9470d29262' and I want to drill down that to the ICUS_MARGIN_SCANNING csv in the table five or six further div levels below that.
With Selenium I'm even further lost where I've been trying variations of driver.find_element_by_link_text('Margin Scanning') and getting nothing back.
Any help with accessing that table and the ICUS_Margin_scanning file would be much appreciated. Thank you!
used f12=> network tab and found this page so here u go
from bs4 import BeautifulSoup
import requests
import datetime
BASE_API_URL='https://www.theice.com'
r=requests.get(f'https://www.theice.com/marginrates/ClearUSMarginParameterFiles.shtml?getParameterFileTable&category=Current&_={int(datetime.datetime.now().timestamp()*1000)}')
soup=BeautifulSoup(r.content,features='lxml')
margin_scanning_link=BASE_API_URL+soup.find_all("a", string="Margin Scanning")[0].attrs['href']
margin_scanning_file=requests.get(margin_scanning_link)
Related
I was writting a scraping program. I firstly used selenium for getting the element's source (an mp4 file), then I see selenium is mainly used for automation and testing but not scraping. I thought using other scraper modules would be more reasonable. But when I use requests+beautifulsoup or urllib2/3+beautifulsoup I couldn't manage to get the inspect elements. They are getting page source but in the web page I'm working, the page source is not the same as the HTML that pops up when I inspect it. (I don't know much about the difference between inspect and page source but I guess it has something to do with JS.) Any ideas how I can solve this issue?
here is my code:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://animefrenzy.org/stream/one-piece-episode-974")
soup = BeautifulSoup(response.text,"lxml")
print(soup)
here is the html I want as string:
Inspect
here is the result I get when I execute the above code:
terminal result
If you just want the HTML(Source) only then here's the code to get that.
from selenium import webdriver
import time
driver = webdriver.Firefox/Chrome(executable_path=r'/path/to/webdriver')
driver.get('https://animefrenzy.org/stream/one-piece-episode-974')
time.sleep(10)
html=driver.page_source
print(html)
This shoud give you the HTML that you want, we used time.sleep(10) because page have to load javascript and change content of the page. If you are not getting desire HTML then try to change sleep time into little more, so that page will loaded completly.
I am trying to make a coronavirus tracker using beautifulsoup just for some practice.
my code is,
import requests
from bs4 import BeautifulSoup
page=requests.get("https://sample.com")
soup=BeautifulSoup(page.content,'html.parser')
table=soup.find("div",class_="ZDcxi")
print(table)
In the output its showing none, but the div tag with the class ZDcxi do have content.
please help
The data, which you see in the browser, and includes the target div, is dynamic content, generated by scripts included with the page and run in the browser. If you just search for the class name in page.content, you will find it is not there.
What many people do is use selenium to open desired pages through Chrome (or another web browser), and then, after the page finishes loading and generating dynamic content, use BeautifulSoup to harvest the content from the browser, and continue processing from there.
Find out more at Requests vs Selenium Python, and also when you search selenium vs requests/
I'm working in an App that uses Web scraping, but I'm having a hard time figuring out how to get some data from a web page. I can see the info that I'm looking for when i use "inspect element" in Firefox:
The thing is that it doesn't appear in the HTML code of the page, which i actually can get using selenium, the data i look for is obviously database driven and I'm stuck right there, Is there a way to scrap this out with selenium?
This is the url btw: http://2ez.gg/#gg?name=Doombag&server=lan
You should probably be trying to scrape http://lan.op.gg/summoner/userName=Doombag instead, http://2ez.gg/#gg?name=Doombag&server=lan contains an iframe which is why you can't find 55% in the document body.
The reason is that the data you want to retrieve is contained inside the iframe, which has caused the selenium cannot visit the data directly.
Try the following code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver=webdriver.Chrome()
URL='http://2ez.gg/#gg?name=Doombag&server=lan'
driver.get(URL)
driver.switch_to.frame('iframe-content')
elem=driver.find_element_by_css_selector('.WinRatioGraph div.Text')
print(elem.text)
Output: 55%
Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()
I'm trying to scrape the values of the crash numbers off of csgocrash.com, but beautifulsoup can't seem to load the whole webpage. Is this because it doesn't load until after it says game loading..? And how would I fix that. Once I can navigate through to the history tab of the website in beautifulsoup I think I will be able to accomplish my goal.