Python 3, post method misleading - python

Here are two examples that demonstrate successful POST requests. But i cannot replicate this autonomously.
Example1
visually required: data={'SearchTxt':'bla'}
actually required: data={'page':'search', 'SearchTxt':'bla'}
import requests
session = requests.Session()
a = session.head('https://www.axemusic.com')
session.cookies['Lm722stores'] = None
session.cookies.set('Lm722stores', '5h5i1rm6q3ur4mg67rs7kb77p4', domain='.axemusic.com', path='/')
response = session.post('https://www.axemusic.com/', data={'page':'search', 'SearchTxt':'bla'})
if response.text.find('Search results for bla') != -1: print('found')
else: print('not found')
Example2
visually required: https://stackoverflow.com data={'q':'bla'}
actually required: https://stackoverflow.com/search data={'q':'bla'}
import requests
session = requests.Session()
a = session.head('https://stackoverflow.com')
session.cookies['prov'] = None
session.cookies.set('prov', '2922137c-e851-cd7e-8df4-9e5eb968ab33', domain='.stackoverflow.com', path='/')
response = session.post('https://stackoverflow.com/search', data={'q':'bla'})
if response.text.find('highlight">bla</span>') != -1: print('found')
else: print('not found')
Is there a way to make this process more autonomous. I'd rather not have to manually test every input in the browser and manually and examine the GET output before knowing what the requests actually requires to perform the POST.

I'd rather not have to manually test every input in the browser and manually and examine the GET output before knowing what the requests actually requires to perform the POST
Yes, by reading documentation for public APIs. E.g., https://api.stackexchange.com/docs
For non public APIs like axemusic you're on your own. It's like asking "how does the baker like his eggs cooked?" No idea 😁

Related

Trying to check if the webdirectory is showing the same thing as index.html

I'm on a blackbox penetration training, last time i asked a question about sql injection which so far im making a progress on it i was able to retrieve the database and the column.
This time i need to find the admin login, so i used dirsearch for that, i checked each webdirectories from dirsearch and sometimes it would show the same page as index.html.
So i'm trying to fix this by automating the process with a script:
import requests
url = "http://depedqc.ph";
webdirectory_path = "C:/PentestingLabs/Dirsearch/reports/depedqc.ph/scanned_webdirectory9-3-2022.txt";
index = requests.get(url);
same = index.content
for webdirectory in open(webdirectory_path, "r").readlines():
webdirectory_split = webdirectory.split();
result = result = [i for i in webdirectory_split if i.startswith(url)];
result = ''.join(result);
print(result);
response = requests.get(result);
if response.content == same:
print("same content");
Only problem is, i get this error:
Invalid URL '': No scheme supplied. Perhaps you meant http://?
Even though the printed result is: http://depedqc.ph/html
What am i doing wrong here? i appreciate a feedback

adding fake_useragent to people_also_ask module

I want to scrape google 'people also ask questions/answer'. I am doing it successfully with the following module.
pip install people_also_ask
The problem is the library is configured such that no one can send many requests to google. I want to send 1000 requests per day and to achieve that I have to add fake_useragent to module. I tried a lot but when I try to add fake user agent to header it gives error. I am not a pro so I must have done wrong myself. Can anyone help me add fake_useragent to module(people_also_ask). here is working code to get question/answer.
from encodings import utf_8
import people_also_ask as paa
from fake_useragent import UserAgent
ua = UserAgent()
while True:
input("Please make sure the queries are in \\query.txt file.\npress Enter to continue...")
try:
query_file = open("query.txt","r")
queries = query_file.readlines()
query_file.close()
break
except:
print("Error with the query.txt file...")
for query in queries:
res_file = open("result.csv","a",encoding="utf_8")
try:
query = query.replace("\n","")
except:
pass
print(f'Searching for "{query}"')
questions = paa.get_related_questions(query, 14)
questions.insert(0,query)
print("\n________________________\n")
main_q = True
for i in questions:
i = i.split('?')[0]
try:
answer = str(paa.get_answer(i)['response'])
if answer[-1].isdigit():
answer = answer[:-11]
print(f"Question:{i}?")
except Exception as e:
print(e)
print(f"Answer:{answer}")
if main_q:
a = ""
b = ""
main_q = False
else:
a = "<h2>"
b = "</h2>"
res_file.writelines(str(f'{a}{i}?{b},"<p>{answer}</p>",'))
print("______________________")
print("______________________")
res_file.writelines("\n")
res_file.close()
print("\nSearch Complete.")
input("Press any key to Exit!")
This is against Google's terms of service, and the wishes of the people_also_ask package. This answer is for educational purposes only.
You asked why fake_useragent is prevented from working. It's not prevented from working, but the people_also_ask package simply isn't implementing any calls to make use of any fake_useragent methods. You can't just import a package and expect another package to start using it. You manually have to make packages work together.
To do that, you have to have some idea of how the 2 packages work. Have a look at the source code and you will see you can make them work together very easily. Just substitute the constant header in people_also_ask with one generated by fake_useragent before you request any data.
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
questions = paa.get_related_questions(query, 14)
and
paa.google.HEADERS = {'User-Agent': ua.random} # replace the HEADER with a randomised HEADER from fake_useragent
answer = str(paa.get_answer(i)['response'])
NOTE:
Not all user agents will work. Google doesn't give related questions depending on the user agent. It is not the fault of either the fake_useragent, or the people_also_ask package.
In order to alleviate this issue somewhat, make sure you call ua.update() and you can also use PR #122 of fake_useragents to only select a subset of the newest user agents which are more likely to work, though you will still get a few missed queries. There is a reason the people_also_ask package didn't bypass or work-around this limitation from google

Incomplete HAR list using Python: Browsermobproxy, selenium, phantomJS

Fairly new to python, I learn by doing, so I thought I'd give this project a shot. Trying to create a script which finds the google analytics request for a certain website parses the request payload and does something with it.
Here are the requirements:
Ask user for 2 urls ( for comparing the payloads from 2 diff. HAR payloads)
Use selenium to open the two urls, use browsermobproxy/phantomJS to
get all HAR
Store the HAR as a list
From the list of all HAR files, find the google analytics request, including the payload
If Google Analytics tag found, then do things....like parse the payload, etc. compare the payload, etc.
Issue: Sometimes for a website that I know has google analytics, i.e. nytimes.com - the HAR that I get is incomplete, i.e. my prog. will say "GA Not found" but that's only because the complete HAR was not captured so when the regex ran to find the matching HAR it wasn't there. This issue in intermittent and does not happen all the time. Any ideas?
I'm thinking that due to some dependency or latency, the script moved on and that the complete HAR didn't get captured. I tried the "wait for traffic to stop" but maybe I didn't do something right.
Also, as a bonus, I would appreciate any help you can provide on how to make this script run fast, its fairly slow. As I mentioned, I'm new to python so go easy :)
This is what I've got thus far.
import browsermobproxy as mob
from selenium import webdriver
import re
import sys
import urlparse
import time
from datetime import datetime
def cleanup():
s.stop()
driver.quit()
proxy_path = '/Users/bob/Downloads/browsermob-proxy-2.1.4-bin/browsermob-proxy-2.1.4/bin/browsermob-proxy'
s = mob.Server(proxy_path)
s.start()
proxy = s.create_proxy()
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', '--ssl-protocol=any'] # so that i can do https connections
driver = webdriver.PhantomJS(executable_path='/Users/bob/Downloads/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs', service_args=service_args)
driver.set_window_size(1400, 1050)
urlLists = []
collectTags = []
gaCollect = 0
varList = []
for x in range(0,2): # I want to ask the user for 2 inputs
url = raw_input("Enter a website to find GA on: ")
time.sleep(2.0)
urlLists.append(url)
if not url:
print "You need to type something in...here"
sys.exit()
#gets the two user url and stores in list
for urlList in urlLists:
print urlList, 'start 2nd loop' #printing for debug purpose, no need for this
if not urlList:
print 'Your Url list is empty'
sys.exit()
proxy.new_har()
driver.get(urlList)
#proxy.wait_for_traffic_to_stop(15, 30) #<-- tried this but did not do anything
for ent in proxy.har['log']['entries']:
gaCollect = (ent['request']['url'])
print gaCollect
if re.search(r'google-analytics.com/r\b', gaCollect):
print 'Found GA'
collectTags.append(gaCollect)
time.sleep(2.0)
break
else:
print 'No GA Found - Ending Prog.'
cleanup()
sys.exit()
cleanup()
This might be a stale question, but I found an answer that worked for me.
You need to change two things:
1 - Remove sys.exit() -- this causes your programme to stop after the first iteration through the ent list, so if what you want is not the first thing, it won't be found
2 - call new_har with the captureContent option enabled to get the payload of requests:
proxy.new_har(options={'captureHeaders':True, 'captureContent': True})
See if that helps.

Python Facebook API - cursor pagination

My question involves learning how to retrieve my entire list of friends using Facebook's Python API. The current result returns an object with limited number of friends and a link to the 'next' page. How do I use this to fetch the next set of friends ? (Please post the link to possible duplicates) Any help would be much appreciated. In general, I need to learn about the pagination involved the API usage.
import facebook
import json
ACCESS_TOKEN = "my_token"
g = facebook.GraphAPI(ACCESS_TOKEN)
print json.dumps(g.get_connections("me","friends"),indent=1)
Sadly the documentation of pagination is an open issue since almost 2 years. You should be able to paginate like this (based on this example) using requests:
import facebook
import requests
ACCESS_TOKEN = "my_token"
graph = facebook.GraphAPI(ACCESS_TOKEN)
friends = graph.get_connections("me","friends")
allfriends = []
# Wrap this block in a while loop so we can keep paginating requests until
# finished.
while(True):
try:
for friend in friends['data']:
allfriends.append(friend['name'].encode('utf-8'))
# Attempt to make a request to the next page of data, if it exists.
friends=requests.get(friends['paging']['next']).json()
except KeyError:
# When there are no more pages (['paging']['next']), break from the
# loop and end the script.
break
print allfriends
Update: There's a new generator method available which implements above behavior and can be used to iterate over all friends like this:
for friend in graph.get_all_connections("me", "friends"):
# Do something with this friend.
Meanwhile I was searching answer here is much better approach:
import facebook
access_token = ""
graph = facebook.GraphAPI(access_token = access_token)
totalFriends = []
friends = graph.get_connections("me", "/friends&summary=1")
while 'paging' in friends:
for i in friends['data']:
totalFriends.append(i['id'])
friends = graph.get_connections("me", "/friends&summary=1&after=" + friends['paging']['cursors']['after'])
At end point you will get one response where data will be empty and then there will be no 'paging' key so at that time it will break and all the data will be stored.
I couldn't find this anywhere, these answers seem super complicated and just no way I would even use an SDK if I had do stuff like that when Paging from a simple POST is so easy to start with, however:
FacebookAdsApi.init(my_app_id, my_app_secret, my_access_token)
my_account = AdAccount('act_23423423423423423')
# In the below, I added the limit to the max rows, 250.
# Also more importantly, paging. the SDK has a really sneaky way of doing this,
# enclose the request in a list() the results end up the same, but this will make the script request new objects until there are no more
#I tested this example and compared to Graph API and as of right now, 1/22 9:47AM, I get 81 from Graph and 81 here.
fields = ['name']
params = {'limit':250}
ads = list(my_account.get_ads(
fields = fields,
params = params,
))
Trick from the docs: "NOTE: We wrap the return value of get_ad_accounts with list() because get_ad_accounts returns an EdgeIterator object (located in facebook_business.adobjects) and we want to get the full list right away instead of having the iterator lazily loading accounts."
https://github.com/facebook/facebook-python-business-sdk
in this example you off set / pagination by one at the time, i think my while loop is simple since it only looking for the pagination key"next" to be none, if doesnt exists means we finish looping, and you will have your results in a list.
in this example i am just looking for all the people call jacob
import requests
import facebook
token = access_token="your token goes here"
fb = facebook.GraphAPI(access_token=token)
limit = 1
offset = 0
data = {"q": "jacob",
"type": "user",
"fields": "id",
"limit": limit,
"offset": offset}
req = fb.request('/search', args=data, method='GET')
users = []
for item in req['data']:
users.append(item["id"])
pag = req['paging']
while pag.get("next") is not None:
offset += limit
data["offset"] = offset
req = fb.request('/search', args=data, method='GET')
for item in req['data']:
users.append(item["id"])
pag = req.get('paging')
print users

How to get all YouTube comments with Python's gdata module?

Looking to grab all the comments from a given video, rather than go one page at a time.
from gdata import youtube as yt
from gdata.youtube import service as yts
client = yts.YouTubeService()
client.ClientLogin(username, pwd) #the pwd might need to be application specific fyi
comments = client.GetYouTubeVideoComments(video_id='the_id')
a_comment = comments.entry[0]
The above code with let you grab a single comment, likely the most recent comment, but I'm looking for a way to grab all the comments at once. Is this possible with Python's gdata module?
The Youtube API docs for comments, the comment feed docs and the Python API docs
The following achieves what you asked for using the Python YouTube API:
from gdata.youtube import service
USERNAME = 'username#gmail.com'
PASSWORD = 'a_very_long_password'
VIDEO_ID = 'wf_IIbT8HGk'
def comments_generator(client, video_id):
comment_feed = client.GetYouTubeVideoCommentFeed(video_id=video_id)
while comment_feed is not None:
for comment in comment_feed.entry:
yield comment
next_link = comment_feed.GetNextLink()
if next_link is None:
comment_feed = None
else:
comment_feed = client.GetYouTubeVideoCommentFeed(next_link.href)
client = service.YouTubeService()
client.ClientLogin(USERNAME, PASSWORD)
for comment in comments_generator(client, VIDEO_ID):
author_name = comment.author[0].name.text
text = comment.content.text
print("{}: {}".format(author_name, text))
Unfortunately the API limits the number of entries that can be retrieved to 1000. This was the error I got when I tried a tweaked version with a hand crafted GetYouTubeVideoCommentFeed URL parameter:
gdata.service.RequestError: {'status': 400, 'body': 'You cannot request beyond item 1000.', 'reason': 'Bad Request'}
Note that the same principle should apply to retrieve entries in other feeds of the API.
If you want to hand craft the GetYouTubeVideoCommentFeed URL parameter, its format is:
'https://gdata.youtube.com/feeds/api/videos/{video_id}/comments?start-index={sta‌​rt_index}&max-results={max_results}'
The following restrictions apply: start-index <= 1000 and max-results <= 50.
The only solution I've got for now, but it's not using the API and gets slow when there's several thousand comments.
import bs4, re, urllib2
#grab the page source for vide
data = urllib2.urlopen(r'http://www.youtube.com/all_comments?v=video_id') #example XhFtHW4YB7M
#pull out comments
soup = bs4.BeautifulSoup(data)
cmnts = soup.findAll(attrs={'class': 'comment yt-tile-default'})
#do something with them, ie count them
print len(cmnts)
Note that due to 'class' being a builtin python name, you can't do regular searches for 'startwith' via regex or lambdas as seen here, since you're using a dict, over regular parameters. It also gets pretty slow due to BeautifulSoup, but it needs to get used because etree and minidom don't find matching tags for some reason. Even after prettyfying() with bs4

Categories