I am trying to scrap this table from Etherscan.io : https://etherscan.io/tokens?ps=100&p=1
Since I am not that familiar with XPath, I have chosen to use the Chrome extension Webscrap.
The problem being the table is contained in several pages.
Is there a way of scraping all the pages at once without mapping each page one by one?
I could also try to it with python directly, I know there are some pretty good libraries out there.
Is that do-able? Would it take me a long time to learn (I know very little about HTML and XPath)?
If so what would be easiest and quickest librairy to learn?
Related
I tried with several different attempts to scrape the following page:
https://www.finanzen.ch/rohstoffe/historisch/weizenpreis/euro/17.4.2022_17.5.2022
Somehow, I'm not successful with request or selenium approach.
Those anybody has an idea how to scrape the data of the historical data table?
Thanks for your hints.
ThinkerBell
You can't bypass this website using simple requests.get, selenium/splash and even rotating-proxies won't work always. This is because, this website uses "Captcha services" and it knows how you are trying to access the page. The headers contains "Content-Disposition: form-data; name='recaptcha-token';" with a long cipher/encoded term, and since this term is based on your browsing activities, copy-pasting it in headers won't work either.
For such tricky websites, best option is to use browser based add-ons like "iMacro". You may also increase chances through Selenium, if you start browsing homepage and loading few more dummy links, before reaching the targeted link.
I am planning to automate web scraper selector rules.Currently we are using selenium to scrape web pages. And its universal problem that websites tend to change their DOM structure. Is there a way to automate the process if the particular rule breaks then it should restructure the rules based on new DOM structure. I know its extremely difficult to do it when the Dom structure completely changes. But is there way to identify and fix leaves level changes?
As far as I know there are no python framework to give a direct solution to it.is there any python library which I should take to help me in this?
I am staring with the url below:
http://www.imdb.com/chart/top
The structure of the HTML file seems to be so confusing:
"
Metascore: "
I am trying to use a format like this:
movie['metascore'] = self.get_text(soup.find('h4', attrs={' ':'Metascore'}))
I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.
However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.
From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).
I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.
Good luck!
I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a
You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.
Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.
I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want