I am trying to simulate a manual process to get the cookies of request headers for specific url
Here's my try till now
import webbrowser
import pyautogui
import time
url = 'https://www.moi.gov.kw/mfservices/immigration-fines/residency/281080801871'
webbrowser.register('chrome',
None,
webbrowser.BackgroundBrowser("C://Program Files//Google//Chrome//Application//chrome.exe"))
webbrowser.get('chrome').open_new(url)
pyautogui.keyDown('ctrl')
pyautogui.keyDown('shift')
pyautogui.press('i')
pyautogui.keyUp('ctrl')
pyautogui.keyUp('shift')
time.sleep(3)
pyautogui.hotkey('f5')
The code simply opens the url in the chrome browser then open the Netwok (Chrome Developer Tools) then refresh so as to get the files loaded in the Network tab in chrome developers tools
How can I copy cURL as cmd by the code or the question in another way, how can I get the cookies in the request headers section?
Related
Goal: Collect all the images from a site as I browse.
I've tried:
requests and wget don't work even with cookies set and all headers changed to mimic Firefox.
Firefox cache has the images, but they all have a random string as the name. I need logical names to sort them.
selenium-wire is very close to working. When I do driver.get(), driver.requests gives me all the requests as expected which can then be saved. The problem is when I click buttons on the site, the new requests do not get added to driver.requests. I tried:
driver = webdriver.Firefox()
driver.get("url")
while True:
time.sleep(1)
# browse site
for request in driver.requests:
if request.response:
if "image/jpeg" in request.response.headers['Content-Type']:
with open(request.url, 'wb') as f:
request.response.body
I have been looking on the internet for an answer for this but so far I haven't found quite what I was looking for. So far I'm able to open a webpage via Python webbrowser, but what I want to know is how to download the HTML file from that webpage that Python has asked the browser (firefox in this case) to open. This is because there are certain webpages with sections that I can not fully access without a certain browser extension/addon (MetaMask), as they require to also log in from within that extension, which is done automatically if I open firefox normally or with the webbrowser module. This is why requesting the HTML with an URL directly from Python with code such as this doesn't work:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
r.text
from urllib.request import urlopen
with urlopen(url) as f:
html = f.read()
The only solution I have got so far is to open the webpage with the webbrowser module and then use the pyautogui module, which I can use to make my PC automatically press Ctrl+S (firefox browser hotkeys to save the HTML file from the webpage I'm currently in) and then make it press enter.
import webbrowser
import pyautogui
import time
def get_html():
url='https://example.com/'
webbrowser.open_new(url) #Open webpage in default browser (firefox)
time.sleep(1.2)
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
pyautogui.press('enter')
get_html()
However, I was wondering if there is a more sophisticated and efficient way that doesn't involve simulated key pressing with pyautogui.
Can you try the following:
import requests
url = 'https://www.google.com/'
r = requests.get(url)
with open('page.html', 'w') as outfile:
outfile.write(r.text)
If the above solution doesn't work, you can use selenium library to open a browser:
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
time.sleep(2)
with open('page.html', 'w') as f:
f.write(driver.page_source)
Now let me start off by saying that I know bs4, scrapy, selenium and so much more can do this but that isnt what I want for numerous reasons.
What I would like to do is open a webbrowser (chrome, ie, firefox) and extract the html from the page after loading the site in that web browser from what webbrowser.
import webbrowser
import time
class ScreenCapture:
url = 'https://www.google.com/'
webbrowser.get("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s").open(url)
# get html from browser that is open
import requests
import pdfkit
# start a session
s = requests.Session()
data = {'username': 'name', 'password': 'pass'}
# POST request with cookies
s.post('https://www.facebook.com/login.php', data= data)
url = 'https://www.facebook.com'
# navigate to page with cookies set
options = {'cookie': s.cookies.items(), 'javascript-delay': 1000}
pdfkit.from_url(url, 'file.pdf', options= options)
I'm trying to automate the process of saving a login-protected webpage as a PDF by setting the cookies and navigating to the page using requests. Is there a better way to tackle this/something I'm doing wrong?
Portal sends login and password with different names and also sends hidden values which can change in every request. It sends to different url than login.php and it can check headers to block bots/scripts.
It can be easier with Selenium which control browser and you can take picture or get HTML to generate PDF.
import selenium.webdriver
import pdfkit
#import time
driver = selenium.webdriver.Chrome()
#driver = selenium.webdriver.Firefox()
driver.get('https://www.facebook.com/login.php')
#time.sleep(1)
driver.find_element_by_id('email').send_keys('your_login')
driver.find_element_by_id('pass').send_keys('your_password')
driver.find_element_by_id('loginbutton').click()
#time.sleep(2)
driver.save_screenshot('output.png') # only visible part
#print(driver.page_source)
pdfkit.from_string(driver.page_source, 'file.pdf')
Maybe using driver "PhantomJS" or module PIL/pillow you could get full page as screenshot.
See generate-full-page-screenshot-in-chrome
With wkhtmltopdf, you can do something like this from command line:
wkhtmltopdf --cookie-jar cookies.txt https://example.com/loginform.html --post 'user_id' 'my_id' --post 'user_pass' 'my_pass --post 'submit_btn' 'submit' throw_away.pdf
wkhtmltopdf --cookie-jar cookies.txt https://example.com/securepage.html keep_this_one.pdf
I'm try to write a web crawler that download a CSV file by a dynamic url.
The url is like http://aaa/bbb.mcv/Download?path=xxxx.csv
I put this url to my chrome browser but I just start to download immediately and the page won't change.
I can't even find any request in develop screen.
I've tried to ways to get the file
put the url in selenium
driver.get(url)
try to get file by requests lib
requests.get(url)
Both didn't work...
Any advice?
Output of two ways:
I try to get the screen shot and it seems doesn't change the page. (just like in chrome)
I try to print out the data I get and it seems like as html file.
Then open it in the browser it is a login page.
import requests
url = '...'
save_location = '...'
session = requests.session()
response = session.get(url)
with open(save_location, 'wb') as t:
for chunk in response.iter_content(1024):
t.write(chunk)
Thanks for everyone's help!
I finally find the problem is that...
I login the website by selenium and I use requests to download the file.
Selenium doesn't have any authentication information!
So my solution is get the cookies by selenium first.
Then send it into the requests!
Here is my Code
cookies = driver.get_cookies() #selenium web driver
s = requests.Session()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
response = s.get(url)