Scrapy and possibilities available

Scrapy and possibilities available - python

I’m looking into web scraping /crawling possibilities and have been reading up on the Scrapy program. I was wondering if anyone knows if it’s possible to input instructions into the script so that once it’s visited the url it can then choose pre-selected dates from a calendar on the website. ?
End result is for this to be used for price comparisons on sites such as Trivago. I’m hoping I can get the program to select certain criteria such as dates once on the website like a human would.
Thanks,
Alex

In theory for a website like Trivago you can use the URL to set the dates you want to query but you will need to research user agents and proxies because otherwise your IP will get blacklisted really fast.

Related

Can I scrape all URL results using Python from a google search without getting blocked?

I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.

It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.

Most efficient way to check twitter friendship? (over 5000 check)

I'm facing problem like this. I used tweepy to collect +10000 tweets, i use nltk naive-bayes classification and filtered the tweets into +5000.
I want to generate a graph of user friendship from that classified 5000 tweet. The problem is that I am able to check it with tweepy.api.show_frienship(), but it takes so much and much time and sometime ended up with endless ratelimit error.
is there any way i can check the friendship more eficiently?

I don't know much about the limits with Tweepy, but you can always write a basic web scraper with urllib and BeautifulSoup to do so.
You could take a website such as www.doesfollow.com which accomplishes what you are trying to do. (not sure about request limits with this page, but there are dozens of other websites that do the same thing) This website is interesting because the url is super simple.
For example, in order to check if Google and Twitter are "friends" on Twitter, the link is simply www.doesfollow.com/google/twitter.
This would make it very easy for you to run through the users as you can just append the users to the url such as 'www.doesfollow.com/'+ user1 + '/' + user2
The results page of doesfollow has this tag if the users are friends on Twitter:
<div class="yup">yup</div>,
and this tag if the users are not friends on Twitter:
<div class="nope">nope</div>
So you could parse the page source code and search to find which of those tags exist to determine if the users are friends on Twitter.
This might not be the way that you wanted to approach the problem, but it's a possibility. I'm not entirely sure how to approach the graphing part of your question though. I'd have to look into that.

Scraping site with limited connections

I have a python script, which scrapes some information from some site.
This site has a daily limitation for 20 connections
So, I decided to use module requests with specified "proxies".
After a couple of hours testing different "proxy-list" sites I've found one and I've been parsing from site http://free-proxy-list.net/.
Seems, this site doesn't get the list updated often and after testing my script I've wasted all the proxies and I can't access the site anymore.
All these searches make me exhausted and I feel like my script completely sucks.
Is there any way I can avoid "detecting" me by site or I just need to find another list of proxies? If there are some sites with daily updated and all new list of proxies - please, let me know.
P.S. I often have stumbled upon sites like https://hide.me where I just enter the link and it gives me full access. Maybe I can just code this in Python? If it's possible - show me, please, how.

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance

There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

To help me learn Python, I decided to screen scrape the football commentaries from the ESPNFC website from the 'live' page (such as here).
It was working up until a day ago but having finally sorted some things out, I went to test it and the only piece of commentary I got back was [u'Commentary Not Available'].
Does anyone have any idea how they are doing this, and any easy and quick ways around? I am using Scrapy/Xpath and Urllib2.
Edit//
for game_id in processQueue:
data_text = getInformation(game_id)
clean_events_dict = getEvents(data_text)
break
Doesn't work the same as
i = getInformation(369186)
j = getEvents(i)
In the first sample, processQueue is a list with game_ids in. The first one of these is given to the script to start scraping. This is broken out of before it has a chance to move on to another game_id
In the second sample I use a single game id.
The first one fails and the second one works and I have absolutely no idea why. Any ideas?

There's a few things you can try, assuming you can still access the data from your browser. Bear in mind, however, that web site operators generally are within their rights to block you; this is why projects that rely on the scraping of a single site are a risky proposition. Here they are:
Delay a few seconds between each scrape
Delay a random number of seconds between each scrape
Accept cookies during your scraping session
Run JavaScript during your session (not possible with Scrapy as far as I know)
Share the scraping load between several IP ranges
There are other strategies which, I generally argue, are less ethical:
Modify your User Agent string to make your scraper look like a browser
I suggest in this answer here that scrapers should be set up to obey robots.txt. However, if you program your scraper to be well-behaved, site operators will have fewer reasons to go to the trouble of blocking you. The most frequent errors I see in this Stack Overflow tag are simply that scrapers are being run far too fast, and they are accidentally causing a (minor) denial of service. So, try slowing down your scrapes first, and see if that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy and possibilities available - python

In theory for a website like Trivago you can use the URL to set the dates you want to query but you will need to research user agents and proxies because otherwise your IP will get blacklisted really fast.

Related

Can I scrape all URL results using Python from a google search without getting blocked?

Most efficient way to check twitter friendship? (over 5000 check)

Scraping site with limited connections

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

Screen scraping with Python/Scrapy/Urllib2 seems to be blocked

Categories

Resources