Web Scraping (Python) Multiple Request Runtime too Slow [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working on a personal project where I need to do multiple requests to scrape keywords & abstract data from different pages (~ 800 requests). Every time run my program it took 30 min to scrape all the data.
I'm thinking two ways to speed up runtime:
read data into CSV file once and use panada to read data from CSV file for future reference
create a MySQL DB and store data in there.
Are these two approaches feasible? It would be great if I get some insights.
Thanks

Having some experience with scraping you have several options as using the requests library to do your GET and Post. -> Please remember to keep the session.
Or then using a framework as scrapy.
The main thing to scrape in an optimal way is to:
Split your work[1];
Use a lots of try/exception handling and save the errors [2];
If you are scraping a lot rate-limit your requests to avoid being blocked[3];
Save information on each step;
And please if you are lost use the Inspect tools on your browser to see the network calls :)
[1] - A timeout is very time consuming and will stop your process until the timeout exception occurs, splitting your work will help on that.
[2] - Several errors may happen and "stop" all your work with a simple error. Using try and catching the exception will allow you to save the errors and later work on it. Saving Where you are working will allow you to resume it later.
[3] - Some sites will block you if you do several requests by minute to be reasonable.

Related

A safe way to make an website run a python script [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I made a small script that solves an combinatorial optimization problem and I would like put it in a website so users can "play" with it, they could send a list of "points" to the server and this script would use a database to return the best combination of these "points".
The problem is I do not have much experience in web dev. I searched how to make an html button execute an script and I found this thread: https://stackoverflow.com/questions/48552343/how-can-i-execute-a-python-script-from-an-html-button#:~:text=To%20run%2C%20open%20command%20prompt,Hope%20this%20helpful.
But there says that an html button calling an python script is not safe. So what would be ideal What would be an ideal, safe alternative so that I could make sure that anyone who accesses my website can execute this script safely?
Well, there's no "easy" answer to your question. What you'd really need to do is to create a web-site in Python on your host computer – using a tool such as Django – and have one of the URLs supported by that website call your script.
Honestly, "what you're asking for here, really isn't the sort of question that StackOverflow is intended to answer." It's too big. Another one of the SE-family sites might be more appropriate, although I'm not quite sure which one ...
The solution that comes to mind would be setting up some Python-API (e.g. with Flask) which you could call with HTTP via JS, having different routes for different usages.
Here's a short overview of Flask showcasing how it could be used.

Is Python a suitable tool for automating data scraping? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am working on a project which involves working with a large amount of data. Essentially, there exists a large repository on some website of excel files that can be downloaded. The site has several different lists of filters and I have several different parameters I am filtering and then collecting data from. Overall, this process requires me to download upwards of 1,000+ excel files and copy and paste them together.
Does Python have the functionality to automate this process? Essentially what I am doing is setting Filter 1 = A, Filter 2 = B, Filter 3 = C, download file, and then repeat with different parameters and copy and paste files together. If Python is suitable for this, can anyone point me in the direction of a good tutorial or starting point? If not, what language would be more suitable for this for someone with little background?
Thanks!
Personally I would prefer to use python for this. I would look in particular at the Pandas library that is a powerful data analysis library that has a dataframe object that can be used like a headless Spreadsheet. I use it for a small number of spreadsheets and it's been very quick. Perhaps take a look at this person's website for more guidance. https://pythonprogramming.net/data-analysis-python-pandas-tutorial-introduction/
I'm not 100% if your question was only about spreadsheets and my first paragraph was really about working on the files once you have downloaded them, but if you're interested in actually fetching the files or 'scraping' the data you can look at the Requests library for the http side of things - this might be what you could use if there is Restful way of doing things. Or, look at scrapy https://scrapy.org for web scraping.
Sorry if I misunderstood in parts.

Django website content built from 3rd party REST API [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm currently working on my first website using the Django framework. Major parts of my content is fetched from a third party API, which requires three API requests to said API in order to fetch all the data I need.
My problem is that this slows down performance a lot, meaning my page load time is about 1-2 seconds, which I don't find satisfying at all.
I'm looking for a few alternatives/best practices for these kind of scenarios. What would one do to speed up page load times? So far, I've been thinking of running a cronjob in the background which calls the APIs for all users that are currently logged in and store the data on my local database, which has a much faster response time.
The other alternative would be loading the API request data separately and adding the data once it has been loaded, however I don't know at all how this would work.
Any other ideas or any tips on how I can improve this?
Thank you!
Tobias
A common practice it's build a cache, so you first look the data in your local database, if doesn't exists, then call the api and save the data.
Without more information it's impossible to write a working example.
You could make a custom method to do all in once.
def call_data(id):
try:
data = DataModel.objects.get(api_id=id)
except Exception, e:
data = requests.get("http://api-call/")
DataModel.objects.create(**data)
return data
This is an example, not to use in production, needs some success validation at least.

Google maps API error handling [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm doing some test with the Google Maps Python API. The script is running fine, it loads GPS coordinates from a csv file and returns the address. It uses the googlemaps.reverse_geocode function.
However after a couple of "searches" I get an .....googlemaps.exceptions.TransportError: ('Connection aborted.....
It would be great if the script just continues when it receives an error.
I have no clue how to handle these error messages, the online documentation doesn't seem very clear about this.
One possibility is that you're using the free API which limits you to 10 requests per second, and you'r Python interpreter probably works a little bit faster, I would use a queue and a timer to fire up to 10 API requests per second. That would be the correct way to do the job if this is indeed the problem you're facing.
A straightforward solution to your problem would be using try and catch, so when the TransportError exception raises you'll know that you should time.sleep() for a while and retry the last API request before moving on to the next iteration.

Creating a generic web crawler in python for news aggregation like Flipboard [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Recently I have been assigned a project at my college, which is a news aggregator. I found Flipboard to be a very interesting and viral app for news aggregation. To achieve this, I am building a web crawler, that will crawl the websites, to fetch recent news and posts. I was going through a post on Gizmod
Is the scraper universal/generic, or are there customer scrapers for
certain sites?
Doll: It is mostly universal/generic. However, we can
limit the amount of content displayed on a site-specific basis. We
already try to do this with some sites that publish extremely
abbreviated RSS feeds- even though we aren't using RSS directly, we
attempt to achieve display parity with their feed.
I am quite familiar with the process of fetching data from a single website. But not sure how could I fetch the data from multiple websites and blogs, all with a completely different structure.
I am currently using Python 2.7, urllib2 and BeautifulSoup for crawling a single website.
Question:
I want to know, how could I achieve the objective of fetching data from thousands of websites via just one generic crawler?
I recommend creating one big Spider class, then subclassing for individual sites. I wrote a short answer to a similar question here on stackoverflow.
I have done something similar, although having a basic knowledge of python and google-fu taught me how to make a script that the more advanced users would scoff at. But hey, it works for my use, and doesn't leave too much footprint.
I made several functions that used 'request' to fetch the sites and used 'beautifulsoup' to parse the individual sites based on the structure I reverse-engineered from the sites by using the inspector in Chrome.
When the script is run, it runs all of the functions, thus fetching the info I want.

Categories