Batch searching on google : 403 error - python

I am trying to do batch searching and go over a list of strings and print the first address that google search returns:
#!/usr/bin/python
import json
import urllib
import time
import pandas as pd
df = pd.read_csv("test.csv")
saved_column = df.Name #you can also use df['column_name']
for name in saved_column:
query = urllib.urlencode({'q': name})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
address = data[u'results'][0][u'url']
print address
I get a 403 error from the server:
'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403
Is what I'm doing is not allowed according to google's terms of service?
I also tried to put time.sleep(5) in the loop but I get the same error.
Thank you in advance

Not allowed by Google TOS. You really can't scrape google without them getting angry. It's also a pretty sophisticated blocker, so you can get around for a little while with random delays, but it fails pretty quickly.
Sorry, you're out of luck on this one.

https://developers.google.com/errors/?csw=1
The Google Search and Language APIs shown to the right have been officially deprecated.
Also
We received automated requests, such as scraping and prefetching. Automated requests are prohibited; all requests must be made as a result of an end-user action.

Related

How to parse all Facebook targetings with Graph API

I'm trying to parse with Graph API job titles(but problem exist for other methods) with method type=adworkposition.
def getJobs(job):
URL = 'https://graph.facebook.com/search?type=adworkposition&q=' + job + '&limit=10000&locale=ru_RU&access_token='+TOKEN
response = requests.get(URL)
response = json.loads(response.text)['data']
df = pd.DataFrame.from_dict(response)
return df
When job = 'developer' I expect to see at least a list of Java Developer, Game Developer, .NET developer, because I know for sure that they exist cause I see them in Facebook Ads Manager.
And also an example from the documentation says that it works that way. But it does not and request returns only ".NET developer". What do I do wrong? Maybe there some regex which I don't know?

PRAW: Authorizing with OAuth prevents me from getting submissions/comments

If I use OAuth, I am unable to get new submissions or comments from a subreddit.
My Oauth code looks like this:
import praw
import webbrowser
r = praw.Reddit(user_agent)
r.set_oauth_app_info(CLIENT_ID, CLIENT_SECRET, REDIRECT_URI)
authURL = r.get_authorize_url("FUZZYPICKLES", "identity submit", True)
webbrowser.open(authURL)
authCode = input("Enter the code: ")
accInfo = r.get_access_information(authCode)
After that I can try to get submissions
submission = r.get_subreddit("test").get_new()
or comments
comments = r.get_comments("test")
but if I use either value, the program crashes with the error:
raise OAuthInsufficientScope('insufficient_scope', response.url)
praw.errors.OAuthInsufficientScope: insufficient_scope on url https://oauth.reddit.com/r/test/comments/.json
If I don't use OAuth, either by using login() or by just not authorizing, I have no such issues. I am using Python 3.4. What am I doing wrong?
I found the solution myself. To read posts, you need "read" in your list of requested scopes. So, "identity submit" should be "identity read submit".

Writing code using graph APIs

I am extremely new to python , scripting and APIs, well I am just learning. I came across a very cool code which uses facebook api to reply for birthday wishes.
I will add my questions, I will number it so that it will be easier for someone else later too. I hope this question will clear lots of newbies doubts.
1) Talking about APIs, in what format are the usually in? is it a library file which we need to dowload and later import? for instance, twitter API, we need to import twitter ?
Here is the code :
import requests
import json
AFTER = 1353233754
TOKEN = ' <insert token here> '
def get_posts():
"""Returns dictionary of id, first names of people who posted on my wall
between start and end time"""
query = ("SELECT post_id, actor_id, message FROM stream WHERE "
"filter_key = 'others' AND source_id = me() AND "
"created_time > 1353233754 LIMIT 200")
payload = {'q': query, 'access_token': TOKEN}
r = requests.get('https://graph.facebook.com/fql', params=payload)
result = json.loads(r.text)
return result['data']
def commentall(wallposts):
"""Comments thank you on all posts"""
#TODO convert to batch request later
for wallpost in wallposts:
r = requests.get('https://graph.facebook.com/%s' %
wallpost['actor_id'])
url = 'https://graph.facebook.com/%s/comments' % wallpost['post_id']
user = json.loads(r.text)
message = 'Thanks %s :)' % user['first_name']
payload = {'access_token': TOKEN, 'message': message}
s = requests.post(url, data=payload)
print "Wall post %s done" % wallpost['post_id']
if __name__ == '__main__':
commentall(get_posts())`
Questions:
importing json--> why is json imported here? to give a structured reply?
What is the 'AFTER' and the empty variable 'TOKEN' here?
what is the variable 'query' and 'payload' inside get_post() function?
Precisely explain almost what each methods and functions do.
I know I am extremely naive, but this could be a good start. A little hint, I can carry on.
If not going to explain the code, which is pretty boring, I understand, please tell me how to link to APIs after a code is written, meaning how does a script written communicate with the desired API.
This is not my code, I copied it from a source.
json is needed to access the web service and interpret the data that is sent via HTTP.
The 'AFTER' variable is supposed to get used to assume all posts after this certain timestamp are birthday wishes.
To make the program work, you need a token which you can obtain from Graph API Explorer with the appropriate permissions.

speed up a HTTP request python and 500 error

I have a code that retrieves news results from this newspaper using a query and a time frame (could be up to a year).
The results are paginated up to 10 articles per page and since I couldn't find a way to increase it, I issue a request for each page then retrieve the title, url and date of each article. Each cycle (the HTTP request and the parsing) takes from 30 seconds to a minute and that's extremely slow. And eventually it will stop with a response code of 500. I am wondering if there is ways to speed it up or maybe make multiple requests at once. I simply want to retrieve the articles details in all the pages.
Here is the code:
import requests
import re
from bs4 import BeautifulSoup
import csv
URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0'
def run(**params):
countryFile = open("EgyptDaybyDay.csv","a")
i=1
results = True
while results:
params["index"]=str(i)
response = requests.get(URL.format(**params))
print response.status_code
htmlFile = BeautifulSoup(response.content)
articles = htmlFile.findAll("div", { "class" : "newslist" })
for article in articles:
url = (article.a['href']).encode('utf-8','ignore')
title = (article.img['alt']).encode('utf-8','ignore')
dateline = article.find("div",{"class": "floatright"})
m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string)
date = m.group(1)
w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow((date, title, url ))
if not articles:
results = False
i+=1
countryFile.close()
run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")
This is a good opportunity to try out gevent.
You should have a separate routine for the request.get part so that your application doesn't have to wait for IO blocking.
You can then spawn multiple workers and have queues to pass requests and articles around.
Maybe something similar to this:
import gevent.monkey
from gevent.queue import Queue
from gevent import sleep
gevent.monkey.patch_all()
MAX_REQUESTS = 10
requests = Queue(MAX_REQUESTS)
articles = Queue()
mock_responses = range(100)
mock_responses.reverse()
def request():
print "worker started"
while True:
print "request %s" % requests.get()
sleep(1)
try:
articles.put('article response %s' % mock_responses.pop())
except IndexError:
articles.put(StopIteration)
break
def run():
print "run"
i = 1
while True:
requests.put(i)
i += 1
if __name__ == '__main__':
for worker in range(MAX_REQUESTS):
gevent.spawn(request)
gevent.spawn(run)
for article in articles:
print "Got article: %s" % article
The most probably slow down is the server, so parallelising the http requests would be the best way to go about making the code run faster, although there's very little you can do to speed up the server response. There's a good tutorial over at IBM for doing exactly this
It seems to me that you're looking for a feed, which that newspaper doesn't advertise. However, it's a problem that has been solved before -- there are many sites that will generate feeds for you for an arbitrary website thus at least solving one of your problems. Some of these require some human guidance, and others have less opportunity for tweaking and are more automatic.
If you can at all avoid doing the pagination and parsing yourself, I'd recommend it. If you cannot, I second the use of gevent for simplicity. That said, if they're sending you back 500's, your code is likely less of an issue and added parallelism may not help.
You can try making all the calls asynchronously .
Have a look at this :
http://pythonquirks.blogspot.in/2011/04/twisted-asynchronous-http-request.html
You could use gevent as well rather than twisted but just telling you the options.
This might very well come close to what you're looking for.
Ideal method for sending multiple HTTP requests over Python? [duplicate]
Source code:
https://github.com/kennethreitz/grequests

What is a post_form_id? (using python urllib2)

I'm interested in writing a python script to log into Facebook and then request some data (mainly checking the inbox). There are few nice examples out there on how to do this. One interesting script i found over here and there is some nice example on stackoverflow itself.
Now i could just copy-paste some of the code i need and get to do what i want, but that wouldn't be a good way to learn. So i am trying to understand what i am actually coding and can't understand some elements of the script in the first example, namely: what is a post_form_id?
Here is the section of the code which refers to "post_form_id" (line 56-72):
# Initialize the cookies and get the post_form_data
print 'Initializing..'
res = browser.open('http://m.facebook.com/index.php')
mxt = re.search('name="post_form_id" value="(\w+)"', res.read())
pfi = mxt.group(1)
print 'Using PFI: %s' % pfi
res.close()
# Initialize the POST data
data = urllib.urlencode({
'lsd' : '',
'post_form_id' : pfi,
'charset_test' : urllib.unquote_plus('%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84'),
'email' : user,
'pass' : passw,
'login' : 'Login'
})
Would you be so kind to tell me what a post_form_id is? And accessorily: would you know what the lsd key/value stands for?
Thanks.
I don't understand why you are trying to "hack" this ...
There is an official api from facebook to read the mailbox of a user, and you need to ask the "read_mailbox" permission for this.
So I advice you to check my post here on how to use facebook and python/django together, and how to login to facebook from python.
And then I would recommend you to read the facebook doc about the messages/inbox.
Basically you need an access_token then you can do http://graph.facebook.com/me/inbox/?access_token=XXX
You can also ask for the "offline_access" permission so you'll need only to get an access token once and you will be able to use it "forever"
And the you can do http://graph.facebook.com/MESSAGE_ID?access_token=XXX to get the details about a particular message.
Or using the api I use in the other thread :
f = Facebook()
res = f.get_object("me/inbox")
...
Feel free to comment if you have any question about this ?

Categories