I would like to scrape this web page:
this web page
but fiding it hard and impossible. How do I scrape this headers from news, and for each radio button?(bearish and bullish are radio buttons).
I also need to do it for eur and usd (so 4 scrapes from that page). How do or can I do this?
Nothing seems to work, I don't have a lot of knoledge of BeautifulSoap, but if tried. I've tried with css, with classes and id, and always no response. It seems it doesn't find those classes.
this was my last attempt:
for e in html.res.select("[class~=fxs_tab]"):
print(e.li.text)
print(type(e.li.text))
titulares.append(e.li.text)
this is the headers I would like to scrape
Related
I'm trying to scrape data from a specific match from the website Sofascore (I use python & Selenium)
I can acess it first going to https://www.sofascore.com/football/2022-05-12 then clicking the match Tottenham - Arsenal with url https://www.sofascore.com/arsenal-tottenham-hotspur/IsR
However, when I enter this link directly from my browser, I arrive to a completely different page for a future match to come.
Is there a way to differentiate the 2 pages to be able to scrape the original match ?
Thanks
I checked the pages you talked about. The first page (https://www.sofascore.com/football/2022-05-12), sends a lot of information to the server.
That's why you get a certain page. If you want to solve this with requests, you'll need to record everything it sends with Burp suite or a similar tool.
You're probably better off just opening it with selenium and then clicking on the first page and getting the page you want...
If you want to check what the current page is in selenium, you can check if the content is what you expect to be on that page...
Jonthan
I'm trying to scrape a BBC website
https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests
and I would like to get all the news articles. But the URL doesn't change when clicking on the next page button so I can only get the first page information. Can anyone help? I'm using Selenium but familiar with requests too. Thanks!
use developer console in your browser, go to networks tab, disable cache.
you can see api requests being made for each page change. you dont need selenium, you can just use requests or aiohttp.
this is an example:
https://push.api.bbci.co.uk/batch?t=%2Fdata%2Fbbc-morph-lx-commentary-data-paged%2Fabout%2Fd5803bfc-472d-4abf-b334-d3fc4aa8ebf9%2FisUk%2Ffalse%2Flimit%2F20%2FnitroKey%2Flx-nitro%2FpageNumber%2F2%2Fversion%2F1.5.6?timeout=5
type "batch" in the filter bar and you should see only the api calls I believe to be responsible for page change.
you can get the about id(d5803bfc-472d-4abf-b334-d3fc4aa8ebf9) of this topic in the webpage source. in this case in, https://www.bbc.com/news/topics/c95yz8vxvy8t/hong-kong-anti-government-protests
i’ve developed a web scraper that extracts reviews from a particular shopping website. It’s coded by Python and the scraping is used based on Selenium + BS4. But my client thinks it’s TOO SLOW and wants it to be using Requests. To scrape the reviews, I have to wait until the reviews show up (or to click a review tab) and then page through for every reviews. I’m guessing the review div is an xhr element or an ajax because the whole page doesn’t load up when i click the next page. All the scrapings are used by BeautifulSoup.
I’m leaving an url so you guys can all go and check!
https://smartstore.naver.com/hoskus/products/4351439834?NaPm=ct%3Dkeuq83q8%7Cci%3Df5e8bd34633b9b48d81db83b289f1b2e0512d2f0%7Ctr%3Dslsl%7Csn%3D888653%7Chk%3D9822b0c3e9b322fa2d874575218c223ce2454a42
I’ve always thought Requests seem to read the HTML far faster than Selenium. But I don’t know how to attain the HTML when it’s all hidden by buttons. Does anybody have an idea? Or something I can refer to?
I am currently begining to use beautifulsoup to scrape websites, I think I got the basics even though I lack theoretical knowledge about webpages, I will do my best to formulate my question.
What I mean with dynamical webpage is the following: a site whose HTML changes based on user action, in my case its collapsible tables.
I want to obtain the data inside some "div" tag but when you load the page, the data seems unavalible in the html code, when you click on the table it expands, and the "class" of this "div" changes from something like "something blabla collapsible" to "something blabla collapsible active" and this I can scrape with my knowledge.
Can I get this data using beautifulsoup? In case I can't, I thought of using something like selenium to click on all the tables and then download the html, which I could scrape, is there an easier way?
Thank you very much.
It depends. If the data is already loaded when the page loads, then the data is available to scrape, it's just in a different element, or being hidden. If the click event triggers loading of the data in some way, then no, you will need Selenium or another headless browser to automate this.
Beautiful soup is only an HTML parser, so whatever data you get by requesting the page is the only data that beautiful soup can access.
I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines
My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.
It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:
You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.
Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":
http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true
This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.
You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.