Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS - python

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()

This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

On Auction websites, there is a clock counting down the time remaining. I am trying to extract that piece of information (among others) to print to a csv file.
For example, I am trying to take the value after 'Time Left:' on this site: https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx
I have tried 3 different options, without any success
1)
time = ''
try:
time = soup.find(id='tzcd').text.replace('Time Left:','')
#print("Time: ",time)
except Exception as e:
print(e)
time = ''
try:
time = soup.find(id='tzcd').text
#print("Time: ",time)
except:
pass
3
time = ''
try:
time = soup.find('div', id="BiddingTimeSection").find_next_sibling("div").text
#print("Time: ",time)
except:
pass
I am a new user of Python and don't know if it's because of the date/time structure of the pull or because of something else inherently flawed with my code.
Any help would be greatly appreciated!

That information is being pulled into page via a Javascript XHR call. You can see that by inspecting Network tab in browser's Dev tools. The following code will get you the time left in seconds:
import requests
s = requests.Session()
header = {'X-AjaxPro-Method': 'GetTimerText'}
payload = '{"inventoryId":271177}'
r = s.get('https://auctionofchampions.com/Michael_Jordan___Magic_Johnson_Signed_Lmt__Ed__Pho-LOT271177.aspx')
s.headers.update(header)
r = s.post('https://auctionofchampions.com/ajaxpro/LotDetail,App_Web_lotdetail.aspx.cdcab7d2.1voto_yr.ashx', data=payload)
print(r.json()['value']['timeLeft'])
Response:
792309
792309 seconds are a bit over 9 days. There are easy ways to return them in days/hours/minutes, if you want.

adding fake_useragent to people_also_ask module

I want to scrape google 'people also ask questions/answer'. I am doing it successfully with the following module.
pip install people_also_ask
The problem is the library is configured such that no one can send many requests to google. I want to send 1000 requests per day and to achieve that I have to add fake_useragent to module. I tried a lot but when I try to add fake user agent to header it gives error. I am not a pro so I must have done wrong myself. Can anyone help me add fake_useragent to module(people_also_ask). here is working code to get question/answer.
from encodings import utf_8
import people_also_ask as paa
from fake_useragent import UserAgent
ua = UserAgent()
while True:
input("Please make sure the queries are in \\query.txt file.\npress Enter to continue...")
try:
query_file = open("query.txt","r")
queries = query_file.readlines()
query_file.close()
break
except:
print("Error with the query.txt file...")
for query in queries:
res_file = open("result.csv","a",encoding="utf_8")
try:
query = query.replace("\n","")
except:
pass
print(f'Searching for "{query}"')
questions = paa.get_related_questions(query, 14)
questions.insert(0,query)
print("\n________________________\n")
main_q = True
for i in questions:
i = i.split('?')[0]
try:
answer = str(paa.get_answer(i)['response'])
if answer[-1].isdigit():
answer = answer[:-11]
print(f"Question:{i}?")
except Exception as e:
print(e)
print(f"Answer:{answer}")
if main_q:
a = ""
b = ""
main_q = False
else:
a = "<h2>"
b = "</h2>"
res_file.writelines(str(f'{a}{i}?{b},"<p>{answer}</p>",'))
print("______________________")
print("______________________")
res_file.writelines("\n")
res_file.close()
print("\nSearch Complete.")
input("Press any key to Exit!")

This is against Google's terms of service, and the wishes of the people_also_ask package. This answer is for educational purposes only.
You asked why fake_useragent is prevented from working. It's not prevented from working, but the people_also_ask package simply isn't implementing any calls to make use of any fake_useragent methods. You can't just import a package and expect another package to start using it. You manually have to make packages work together.
To do that, you have to have some idea of how the 2 packages work. Have a look at the source code and you will see you can make them work together very easily. Just substitute the constant header in people_also_ask with one generated by fake_useragent before you request any data.
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
questions = paa.get_related_questions(query, 14)
and
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
answer = str(paa.get_answer(i)['response'])
NOTE:
Not all user agents will work. Google doesn't give related questions depending on the user agent. It is not the fault of either the fake_useragent, or the people_also_ask package.
In order to alleviate this issue somewhat, make sure you call ua.update() and you can also use PR #122 of fake_useragents to only select a subset of the newest user agents which are more likely to work, though you will still get a few missed queries. There is a reason the people_also_ask package didn't bypass or work-around this limitation from google

Test if URL's in bookmark is offline or online in python

I wonder if it is possible to test the URL's from my bookmarks.
So I can see if the URL still is online or offline.
I can see that I can test it with Urllip2
Urllip2 code
import socket
from urllib2 import urlopen, URLError, HTTPError
socket.setdefaulttimeout( 23 ) # timeout in seconds
url = 'http://google.com/'
try :
response = urlopen( url )
except HTTPError, e:
print 'The server couldn\'t fulfill the request. Reason:', str(e.code)
except URLError, e:
print 'We failed to reach a server. Reason:', str(e.reason)
else :
html = response.read()
print 'got response!'
# do something, turn the light on/off or whatever
My question is, can I get the links/URL's from my bookmarks (Chrome) and the test the URL's in a loop (for) if the URL is Offline or Online.
EDIT 26/02/2019...
Have t/ried this code, and get no folder found error..
/
import json
from jsonpath_rw import parse
import os
# PArse te Bookmarks file from json into a dict
input_filename = os.path.join(os.getenv("APPDATA"), "\\Local\\Google\\Chrome\\User Data\\Default\\Bookmarks")
if os.path.isfile(input_filename):
with open(input_filename) as data_file:
bookmark_data = json.load(data_file)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(bookmark_data)])
else:
print("File not found!")
print(input_filename)

Chrome (or at least Chromium) stores your bookmarks in a file called Bookmarks in your chrome config area - on linux this is usually .config/chromium/Default/Bookmarks on Windows it is AppData\Local\Google\Chrome\User Data\Default\Bookmarks (though you may need to hunt for it if your system is different).
Assuming you wan to check all links, then you probably want to recursively walk the tree, looking for url keys and getting their values. Since this is JSON, I would recommend using the JSONPath library (https://readthedocs.org/projects/jsonpath-rw/), rather than writing your own recursion function:
import json
from jsonpath_rw import parse
# PArse te Bookmarks file from json into a dict
with open('Bookmarks') as bm:
data = json.load(bm)
# Set an xpath expression for all 'url' children
expr = parse('$..url')
# print the value of all url keys
print([x.value for x in expr.find(data)])

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".

This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1

There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS

You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

fetch text from a web site and displaying it back

Currently, there's a game that has different groups, and you can play for a prize 'gold' every hour. Sometimes there is gold, sometimes there isn't. It is posted on facebook every hour ''gold in group2" or "gold in group6'', and other times there isn't a post due to no gold being a prize for that hour. I want to write a small script that will check the site hourly and grab the result (if there is gold or not, and what group) and display it back to me. I was wanting to write it in python as I'm learning it. Would this be the best language to use? And how would I go about doing this? All I can really find is information on extracting links. I don't want to extract links, just the text. Thanks for any and all help. I appreciate it.

Check out urllib2 for getting html from a url and BeautifulSoup/HTMLParser/etc to parse the html. Then, you could use something like this as a starting point for the script:
import time
import urllib2
import BeautifulSoup
import HTMLParser
def getSource(url, postdata):
source = ""
req = urllib2.Request(url, postdata)
try:
sock = urllib2.urlopen(req)
except urllib2.URLError, exc:
# handle the error..
pass
else:
source = sock.read()
finally:
try:
sock.close()
except:
pass
return source
def parseSource(source):
pass
# parse source with BeautifulSoup/HTMLParser, or here...
def main():
last_run = 0
while True:
t1 = time.time()
# check if 1 hour has passed since last_run
if t1 - last_run >= 3600:
source = getSource("someurl.com", "user=me&blah=foo")
last_run = time.time()
parseSource(source)
else:
# sleep for 60 seconds and check time again.
time.sleep(60)
return 0
if __name__ == "__main__":
sys.exit(main())
Here is a good article about parsing-html-with-python

I have something similiar to what you have, but you left out what my main question revolves around. I looked at htmlparser and bs, but I am unsure how to do something like if($posttext == gold) echo "gold in so and so".. seems like bs deals a lot with tags..i suppose since facebook posts can use a variety of tags, how would i go about doing just a search on the text and to return the 'post' ??

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS - python

Related

Having trouble using Beautiful Soup's 'Next Sibling' to extract some information

adding fake_useragent to people_also_ask module

Test if URL's in bookmark is offline or online in python

Python script for "Google search by image"

fetch text from a web site and displaying it back

Categories

Resources