Crawl images from google search with python - python

I am trying to write a script in python in order to crawl images from google search. I want to track the urls of images and after that store those images to my computer. I found a code to do so. However it only track 60 urls. Afterthat a timeout message appears. Is it possible to track more than 60 images?
My code:
def crawl_images(query, path):
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
'v=1.0&q=' + query + '&start=%d'
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
counter = 1
urls = []
start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
url = image_info['unescapedUrl']
print url
urls.append(url)
image = urllib.URLopener()
try:
image.retrieve(url,"model runway/image_"+str(counter)+".jpg")
counter +=1
except IOError, e:
# Throw away some gifs...blegh.
print 'could not save %s' % url
continue
print start
start += 4 # 4 images per page.
time.sleep(1.5)
crawl_images('model runway', '')

Have a look at the Documentation: https://developers.google.com/image-search/v1/jsondevguide
You should get up to 64 results:
Note: The Image Searcher supports a maximum of 8 result pages. When
combined with subsequent requests, a maximum total of 64 results are
available. It is not possible to request more than 64 results.
Another note: You can restrict the file type, this way you dont need to ignore gifs etc.
And as an additional Note, please keep in mind that this API should only be used for user operations and not for automated searches!
Note: The Google Image Search API must be used for user-generated
searches. Automated or batched queries of any kind are strictly
prohibited.

You can try the icrawler package. Extremely easy to use. I've never had problems with the number of images to be downloaded.

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!
That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.

Transform a multiple line url request into a function in Python

I try to download a serie of text files from different websites. I am using urllib.request with Python. I want to expend the list of URL without making the code long.
The working sequence is
import urllib.request
url01 = 'https://web.site.com/this.txt'
url02 = 'https://web.site.com/kind.txt'
url03 = 'https://web.site.com/of.txt'
url04 = 'https://web.site.com/link.txt'
[...]
urllib.request.urlretrieve(url01, "Liste n°01.txt")
urllib.request.urlretrieve(url02, "Liste n°02.txt")
urllib.request.urlretrieve(url03, "Liste n°03.txt")
[...]
The number of file to download is increasing and I want to keep the second part of the code short.
I tried
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0+"i"+.txt")
It doesn't work and I am thinking that a while loop can be use for string but not for request.
So I was thinking of making it a function.
def newfunction(i)
return urllib.request.urlretrieve(url"i", "Liste n°0"+1+".txt")
But it seem that I am missing a big chunk of it.
This request is working but it seem I cannot transform it for long list or URL.
As a general suggestion, I'd recommend the requests module for Python, rather than urllib.
Based on that, some naive code for a possible function:
import requests
def get_file(site, filename):
target = site + "/" + filename
try:
r = requests.get(target, allow_redirects=True)
open(filename, 'wb').write(r.content)
return r.status_code
except requests.exceptions.RequestException as e:
print("File not downloaded, error: {}".format(e))
You can then call the function, passing in parameters of site and file name:
get_file('https://web.site.com', 'this.txt')
The function will raise an exception, but not stop execution, if it cannot download a file. You could expand exception handling to deal with files not being writable, but this should be a start.
It seems as if you're not casting the variable i to an integer before your concatenating it to the url string. That may be the reason why you're code isn't working. The while-loop/for-loop approach shouldn't effect whether or not the requests get sent out. I recommend using the requests module for making requests as well. Mike's post covers what the function should kind of look like. I also recommend creating a sessions object if you're going to be making a whole lot of requests in a piece of code. The sessions object will keep the underlying TCP connection open while you make your requests, which should reduce latency, CPU usage, and network congestion (https://en.wikipedia.org/wiki/HTTP_persistent_connection#Advantages). The code would look something like this:
import requests
with requests.Session() as s:
for i in range(10):
s.get(str(i)+'.com') # make request
# write to file here
To cast to a string you would want something like this:
i = 0
while i<51
i = i +1
urllib.request.urlretrieve( i , "Liste n°0" + str(i) + ".txt")

Batch create campaigns via Facebook ads API with Python?

I'm trying to build an API tool for creating 100+ campaigns at a time, but so far I keep running into timeout errors. I have a feeling it's because I'm not doing this as a batch/async request, but I can't seem to find straightforward instructions specifically for batch creating campaigns in Python. Any help would be GREATLY appreciated!
I have all the campaign details prepped and ready to go in a Google sheet, which my script then reads (using pygsheets) and attempts to create the campaigns. Here's what it looks like so far:
from facebookads.adobjects.campaign import Campaign
from facebookads.adobjects.adaccount import AdAccount
from facebookads.api import FacebookAdsApi
from facebookads.exceptions import FacebookRequestError
import time
import pygsheets
FacebookAdsApi.init(access_token=xxx)
gc = pygsheets.authorize(service_file='xxx/client_secret.json')
sheet = gc.open('Campaign Prep')
tab1 = sheet.worksheet_by_title('Input')
tab2 = sheet.worksheet_by_title('Output')
# gets range size, offsetting it by 1 to account for the range starting on row 2
row_range = len(tab1.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))+1
# finds first empty row in the output sheet
start_row = len(tab2.get_values('A1', 'A', returnas='matrix', majdim='ROWS', include_empty=False))
def create_campaigns(row):
campaign = Campaign(parent_id=row[6])
campaign.update({
Campaign.Field.name: row[7],
Campaign.Field.objective: row[9],
Campaign.Field.buying_type: row[10],
})
c = campaign.remote_create(params={'status': Campaign.Status.active})
camp_name = c['name']
camp_id = 'cg:'+c['id']
return camp_name, camp_id
r = start_row
# there's a header so I have the range starting at 2
for x in range(2, int(row_range)):
r += 1
row = tab1.get_row(x)
camp_name, camp_id = create_campaigns(row)
# pastes the generated campaign ID, campaign name and account id back into the sheet
tab2.update_cells('A'+str(r)+':C'+str(r).format(r),[[camp_id, camp_name, row[6].rsplit('_',1)[1]]])
I've tried putting this in a try loop and if it runs into a FacebookRequestError have it do time.sleep(5) then keep trying, but I'm still running into timeout errors every 5 - 10 rows it loops through. When it doesn't timeout it does work, I guess I just need to figure out a way to make this handle big batches of campaigns more efficiently.
Any thoughts? I'm new to the Facebook API and I'm still a relative newb at Python, but I find this stuff so much fun! If anyone has any advice for how this script could be better (as well as general Python advice), I'd love to hear it! :)
Can you post the actual error message?
It sounds like what you are describing is that you hit the rate limits after making a certain amount of calls. If that is so, time.sleep(5) won't be enough. The rate score decays over time and will be reset after 5 minutes https://developers.facebook.com/docs/marketing-api/api-rate-limiting. In that case I would suggest making a sleep between each call instead. However a better option would be to upgrade your API status. If you hit the rate limits this fast I assume you are on Developer level. Try upgrading first to Basic and then Standard and you should not have these problems. https://developers.facebook.com/docs/marketing-api/access
Also, as you mention, utilizing Facebook's batch request API could be a good idea. https://developers.facebook.com/docs/marketing-api/asyncrequests/v2.11
Here is a thread with examples of the Batch API working with the Python SDK: https://github.com/facebook/facebook-python-ads-sdk/issues/116
I paste the code snippet (copied from the last link that #reaktard pasted), credit to github user #williardx
it helped me a lot in my development.
# ----------------------------------------------------------------------------
# Helper functions
def generate_batches(iterable, batch_size_limit):
# This function can be found in examples/batch_utils.py
batch = []
for item in iterable:
if len(batch) == batch_size_limit:
yield batch
batch = []
batch.append(item)
if len(batch):
yield batch
def success_callback(response):
batch_body_responses.append(response.body())
def error_callback(response):
# Error handling here
pass
# ----------------------------------------------------------------------------
batches = []
batch_body_responses = []
api = FacebookAdsApi.init(your_app_id, your_app_secret, your_access_token)
for ad_set_list in generate_batches(ad_sets, batch_limit):
next_batch = api.new_batch()
requests = [ad_set.get_insights(pending=True) for ad_set in ad_set_list]
for req in requests:
next_batch.add_request(req, success_callback, error_callback)
batches.append(next_batch)
for batch_request in batches:
batch_request.execute()
time.sleep(5)
print batch_body_responses

how to make xgoogle return google first page

Have been trying to use this xgoogle to search for pdfs on the internet.. the problem am having is that if i search for "Medicine:pdf" the first page returns to me is not the first page google returns,i.e if i actually use google.... dont know whats wrong here is ma code
try:
page = 0
gs = GoogleSearch(searchfor)
gs.results_per_page = 100
results = []
while page < 2:
gs.page=page
results += gs.get_results()
page += 1
except SearchError, e:
print "Search failed: %s" % e
for res in results:
print res.desc
if i actually use google website to search for the query the first page google display for me is :
Title : Medicine - British Council
Desc :United Kingdom medical training has a long history of excellence and of ... Leaders in medicine throughout the world have received their medical education.
Url : http://www.britishcouncil.org/learning-infosheets-medicine.pdf‎
But if I used my python Xgoogle Search I get :
Python OutPut
Descrip:UCM175757.pdf
Title:Medicines in My Home: presentation for students - Food and Drug ...
Url:http://www.fda.gov/downloads/Drugs/ResourcesForYou/Consumers/BuyingUsingMedicineSafely/UnderstandingOver-the-CounterMedicines/UCM175757.pdf
I noticed it is difference between using xgoogle and using google in browser. I have no idea why, but you could try the google custom search api. The google custom search api may give you more close result and no risk of banned from google(if you use xgoogle to many times in one short period, you have an error return instead of search result).
first you have to register and enable your custom search in google to get key and cx
https://www.google.com/cse/all
the api format is:
'https://www.googleapis.com/customsearch/v1?key=yourkey&cx=yourcx&alt=json&q=yourquery'
customsearch is the google function you want to use, in your case I think it is customsearch
v1 is the version of you app
yourkey and yourcx are provided from google you could find it on you dashboard
yourquery is the term you want to search, in your case is "Medicine:pdf"
json is the return format
example return the first 3 pages of google custom search results:
import urllib2
import urllib
import simplejson
def googleAPICall():
userInput = urllib.quote("global warming")
KEY = "##################" # get yours
CX = "###################" # get yours
for i in range(0,3):
index = i*10+1
url = ('https://scholar.googleapis.com/customsearch/v1?'
'key=%s'
'&cx=%s'
'&alt=json'
'&q=%s'
'&num=10'
'&start=%d')%(KEY,CX,userInput,index)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
results = simplejson.load(response)

Twilio Python helper library - How do you know how many pages list resource returned?

I'm trying to write a simple script to download call details information from Twilio using the python helper library. So far, it seems that my only option is to use .iter() method to get every call ever made for the subaccount. This could be a very large number.
If I use the .list() resource, it doesn't seem to give me a page count anywhere, so I don't know for how long to continue paging to get all calls for the time period. What am I missing?
Here are the docs with code samples:
http://readthedocs.org/docs/twilio-python/en/latest/usage/basics.html
It's not very well documented at the moment, but you can use the following API calls to page through the list:
import twilio.rest
client = twilio.rest.TwilioRestClient(ACCOUNT_SID, AUTH_TOKEN)
# iterating vars
remaining_messages = client.calls.count()
current_page = 0
page_size = 50 # any number here up to 1000, although paging may be slow...
while remaining_messages > 0:
calls_page = client.calls.list(page=current_page, page_size=page_size)
# do something with the calls_page object...
remaining_messages -= page_size
current_page += 1
You can pass in page and page_size arguments to the list() function to control which results you see. I'll update the documentation today to make this more clear.
As mentioned in the comment, the above code did not work because remaining_messages = client.calls.count() always returns 50, making it absolutely useless for paging.
Instead, I ended up just trying next page until it fails, which is fairly hacky. The library should really include numpages in the list resource for paging.
import twilio.rest
import csv
account = <ACCOUNT_SID>
token = <ACCOUNT_TOKEN>
client = twilio.rest.TwilioRestClient(account, token)
csvout = open("calls.csv","wb")
writer = csv.writer(csvout)
current_page = 0
page_size = 50
started_after = "20111208"
test = True
while test:
try:
calls_page = client.calls.list(page=current_page, page_size=page_size, started_after=started_after)
for calls in calls_page:
writer.writerow( (calls.sid, calls.to, calls.duration, calls.start_time) )
current_page += 1
except:
test = False

Categories