Problems with iterations and retrieving information with python3 and selenium - python

I am new to python and managed to write a little program (using python3) to retrieve information from a website. I have two problems:
I do not know how to tell python to wait each 80th step, so when i = 80, 160, 240 etc.
I do not know how to tell python to retrieve the information from the website how many steps exist in total (as this varies from page to page), see image below. I can see in the picture that the maximum amount of 260 is "hard-coded" in this example? Can I tell python to retrieve the 260 by itself (or any other number if this changes on another web page)?
How can I tell python to check which is the current page the script starts, so that it can adjust i to the page`s number? Normally I presume to start at page 0 (i = 0), but for example, if I were to start at page 30, my script shall be able to make i = 30 or if I start at 200, it shall be able to adjust i = 200 etc before it goes to the while loop.
Is it clear what I am troubling with?
This is the pseudo code:
import time
from selenium import webdriver
url = input('Please, enter url: ')
driver = webdriver.Firefox()
driver.get(url)
i = 0
while i > 260: # how to determine (book 1 = 260 / book 2 = 500)?
# do something
if i == 80: # each 80th page?
# pause
else:
# do something else
i = i + 1
else:
quit()

1) sleep
import time
....
if i % 80 == 0: # each 80th page?
# Wait for 5 seconds
time.sleep(5)
2) element selectors
html = driver.find_element_by_css_selector('afterInput').get_attribute('innerHTML')
3) arguments
import sys
....
currentPage = sys.argv[2]
or extract it from the source (see 2)

First, if you want to know if your i is "step"(devision) of 80 you can use the modulo sign, and check if it equal to 0, for instance:
if i % 80 == 0:
time.sleep(1) # One second
Second, you need to query the html you receive from the server, for instance:
from selenium import webdriver
url = input('Please, enter url: ')
driver = webdriver.Firefox()
driver.get(url)
total_pages = driver.find_element_by_css_selector('afterInput').get_attribute('innerHTML').split()[1] # Take only the number
after your edit: All you have to do is to is to assign i with this value you want by defining a variable in your script/parsing the arguments from the command line/scrape it from the website. This is Depends on your implementation and needs.
Other notes
I know you're on your beginning steps, but if you want to improve your code and make it a bit more pythonic I would do the following changes:
Using while and i = i + 1 is not a common pattern in python, instead use for i in range(total_pages) - of course you need to know the number of pages (from your second question)
There is no need to call quit(), your script will end anyway in the end of the file.
I think you meant while i < 260.

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!
That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.

using automated testing to send text and image(outside) from exel list to whatsapp file but not sending this every contact list

Loop works when import image is not scripted
pre = os.path.dirname(os.path.realpath(__file__))
f_name = 'wpcontacts.xlsx'
path = os.path.join(pre, f_name)
f_name = pandas.read_excel(path)
count = 0
image_url = input("url here")
driver = webdriver.Chrome(executable_path='D:/Old Data/Integration Files/new/chromedriver')
driver.get('https://web.whatsapp.com')
sleep(25)
for column in f_name['Contact'].tolist():
try:
driver.get('https://web.whatsapp.com/send?phone=' + str(f_name['Contact'][count]) + '&text=' + str(
f_name['Messages'][0]))
sent = False
sleep(7)
# It tries 3 times to send a message in case if there any error occurred
click_btn = driver.find_element(By.XPATH,
'/html/body/div[1]/div/div/div[4]/div/footer/div[1]/div/span[2]/div/div[2]/div[2]/button/span')
file_path = 'amazzon.jpg'
driver.find_element(By.XPATH,
'//*[#id="main"]/footer/div[1]/div/span[2]/div/div[1]/div[2]/div/div/span').click()
sendky = driver.find_element(By.XPATH,
'//*[#id="main"]/footer/div[1]/div/span[2]/div/div[1]/div[2]/div/span/div/div/ul/li[1]/button/span')
input_box = driver.find_element(By.TAG_NAME, 'input')
input_box.send_keys(image_url)
sleep(3)
except Exception:
print("Sorry message could not sent to " + str(f_name['Contact'][count]))
else:
sleep(3)
driver.find_element(By.XPATH,
'//*[#id="app"]/div/div/div[2]/div[2]/span/div/span/div/div/div[2]/div/div[2]/div[2]/div/div').click()
sleep(2)
print('Message sent to: ' + str(f_name['Contact'][count]))
count = count + 1
output is
Message sent to: 919891350373
Process finished with exit code 0
how convert this code into loop so that i can send text to every no. mentioned in exel file
thanks
Firstly, if what you've written in the question is the code you are using, I am confused how you aren't getting a syntax error due to the tab spacing eg here:
try:
driver.get('https://web.whatsapp.com/send?phone=' + str(f_name['Contact'][count]) + '&text=' + str(
f_name['Messages'][0]))
I am going to assume this is a mixup related to copy-paste.
Next, I'll just mention the following: I highly doubt you need a 25-second sleep for the page to load, and the default test timeout, and the default timeout for Selenium tests is 30 seconds, so with the other sleeps you've added I'm not sure why it's not simply timing out unless you've overridden this timeout in some other part of the code that's not added in your question.
What is the point of doing driver.get('https://web.whatsapp.com'), then following it with another driver.get()?
All this aside, it would make sense to me that your problem lies with the spacing for your increment count = count + 1; it is not inside your for loop in the code as I see it. So, the count is not actually incremented in the loop itself but rather after the whole loop is executed. If it does not help to add a tab before the count increment, I'm quite sure that you've made some mistake(s) pasting the code here so please organize it such that we can see what code is actually being executed.
Finally, another comment I have: the xpaths you've got scare me. You should almost NEVER use an absolute xpath (like '/html/body/div[1]/div/div/div[4]/div/footer/div[1]/div/span[2]/div/div[2]/div[2]/button/span'). Just about any change to the HTML on the page will cause this to break. I haven't the time to find better selectors for you, but I highly recommend you examine these.
Let me know whether any of the above helps or not!

Loading time of each page with selenium python

Helllo everyone,
Can you please anyone help to calculate the load time of each page. I want to do with performance analysis of web page. This below code works with the Complete execution time. But i want to calculate the each loading page after everyclick.
navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
responseStart = driver.execute_script("return window.performance.timing.responseStart")
domComplete = driver.execute_script("return window.performance.timing.domComplete")
backendPerformance_calc = responseStart - navigationStart
frontendPerformance_calc = domComplete - responseStart
print("Back End: %s" % backendPerformance_calc)
print("Front End: %s" % frontendPerformance_calc)
Can you anyone help me solve this problem.
You can use this js to perform this check:
state = driver.execute_script(" return document.readyState; ")
Or, you can simple add an explicit wait after a specific element and see when it was displayed, then do some math (when you clicked and when the element was displayed)

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Crawl images from google search with python

I am trying to write a script in python in order to crawl images from google search. I want to track the urls of images and after that store those images to my computer. I found a code to do so. However it only track 60 urls. Afterthat a timeout message appears. Is it possible to track more than 60 images?
My code:
def crawl_images(query, path):
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
'v=1.0&q=' + query + '&start=%d'
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
counter = 1
urls = []
start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
url = image_info['unescapedUrl']
print url
urls.append(url)
image = urllib.URLopener()
try:
image.retrieve(url,"model runway/image_"+str(counter)+".jpg")
counter +=1
except IOError, e:
# Throw away some gifs...blegh.
print 'could not save %s' % url
continue
print start
start += 4 # 4 images per page.
time.sleep(1.5)
crawl_images('model runway', '')
Have a look at the Documentation: https://developers.google.com/image-search/v1/jsondevguide
You should get up to 64 results:
Note: The Image Searcher supports a maximum of 8 result pages. When
combined with subsequent requests, a maximum total of 64 results are
available. It is not possible to request more than 64 results.
Another note: You can restrict the file type, this way you dont need to ignore gifs etc.
And as an additional Note, please keep in mind that this API should only be used for user operations and not for automated searches!
Note: The Google Image Search API must be used for user-generated
searches. Automated or batched queries of any kind are strictly
prohibited.
You can try the icrawler package. Extremely easy to use. I've never had problems with the number of images to be downloaded.

Categories