Best method to scrape info when multiple data needs to be entered

Best method to scrape info when multiple data needs to be entered - python

So I want to make a program that scrapes info from the lifesaving society website. https://www.lifesavingsociety.com/find-a-member.aspx
I would be coding this in python btw. Basically, I would want the program to take in a list of lifesaving id's and return info about when each person's certifications are expiring. The problem is, to scrape this data I feel that I would need to have the program individually enter each lifesaver's id # and then scrape the data. This feels like something that would take a long time and I thought maybe there was a better way to do it. Any ideas?

Related

How can I extract this information with Python without simulating clicks?

I'm scraping Ali Express products such as this one using Python. It has multiple variations, each with its own price. When one is clicked on, the price is updated to reflect this choice.
In a similar fashion, there are multiple buttons to choose where you want the item to be shipped from, which updates the shipping cost accordingly.
I want to scrape each variation's price as sent from each country. How can I do that without simulating clicks to change the prices so that I can scrape them? Where is the underlying logic that governs these price changes laid out? I couldn't find it when inspecting elements. Is it easily decipherable?
Or do I just need to give up and simulate clicks? If so, would that be done with Selenium? The reason I would prefer to extract it without clicking is that, for products such as the one I linked to, for example, there are 49 variations and 5 places from which the product is shipped so it would be a lot of clicking and a rather inelegant approach.
Thanks a lot!

take a look in the browser, all the data is in the dom
type window.runParams.data.skuModule.skuPriceList in you console you will see

I know that ecommerce companies applies this kind of logic in their backend apis. And to protect the apis from normal users. They use consul which is used to resolve the ips recieved from front end.
Now coming to your question. There can be two cases.
Frontend recieves the data from backend and applies their own logic. So i can tell you that the front end has already recieved all the data related to variants and its price. So they are storing it at their end in some data structure. And they update the values on the view only when you click the item.( You can find if this is the case if after clicking there is no delay and result is shown instantly). Though you can check the response fetched from the backend, it is bound to have all data which frontend is recieving and storing. You can check in chrome-debug tools->network->gql to filter
Second case in which it is fetching data each time from backend when you click. In that case it is changing some parameters on the link. If you can find out some kind of logic behind how parameters are being changed for similar variants maybe you can fetch the information then.(There will be delay in showing results after clicking)
I think its a good idea to use selenium or cypress. I know it will take time. But its the best option you got.

Scrapy and possibilities available

I’m looking into web scraping /crawling possibilities and have been reading up on the Scrapy program. I was wondering if anyone knows if it’s possible to input instructions into the script so that once it’s visited the url it can then choose pre-selected dates from a calendar on the website. ?
End result is for this to be used for price comparisons on sites such as Trivago. I’m hoping I can get the program to select certain criteria such as dates once on the website like a human would.
Thanks,
Alex

In theory for a website like Trivago you can use the URL to set the dates you want to query but you will need to research user agents and proxies because otherwise your IP will get blacklisted really fast.

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?

There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

Scrapy Case : Incremental Update of Items

Please help me solve following case:
Imagine a typical classified category page. A page with list of items. When you click on items you land on internal pages.Now currently my crawler scrapes all these URLs, further scrapes these urls to get details of the item, check to see if the initial seed URL as any next page. If it has, it goes to the next page and do the same. I am storing these items in a sql database.
Let say 3 days later, there are new itmes in the Seed URL and I want to scrap only new items. Possible solutions are:
At the time of scraping each item, I check in the database to see if the URL is already scraped. If it has, I simply ask Scrapy to stop crawling further.
Problem : I don't want to query database each time. My database is going to be really large and it will eventually make crawling super slow.
I try to store last scraped URL and pass it on in the beginning, and the moment it finds this last_scraped_url it simply stops the crawler.
Not possible, given the asynchronous nature of crawling URLs are not scraped in the same order they are received from seed URLs.
( I tried all methods to make it in orderly fashion - but that's not possible at all )
Can anybody suggest any other ideas ? I have been struggling over it for past three days.
Appreciate your replies.

Before trying to give you an idea...
I must say I would try first your database option. Databases are made just for that and, even if your DB gets really big, this should not do the crawling significantly slow.
And one lesson I have learned: "First do the dumb implementation. After that, you try to optimize." Most of times when you optimize first, you just optimize the wrong part.
But, if you really want another idea...
Scrapy's default is not to crawl the same url two times. So, before start the crawling you can put the already scraped urls (3 days before) in the list that Scrapy uses to know which urls were already visited. (I don't know how to do that.)
Or, simpler, in your item parser you can just check if the url was already scraped and return None or scrape the new item accordingly.

How can I iterate through the pages of a website using Python?

I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?

Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.

To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.

For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best method to scrape info when multiple data needs to be entered - python

Related

How can I extract this information with Python without simulating clicks?

Scrapy and possibilities available

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

Scrapy Case : Incremental Update of Items

How can I iterate through the pages of a website using Python?

Categories

Resources