Requirement
Get the Final Url after redirect chain like it happens on the normal browser using selenium .
The urls are article urls got from twitter.
Behavior on normal Desktop browser after viewing the redirect and headers:
The Url is a twitter URL which gets a 301 error -moved permanently
Then it follows the location tag to a shortened url which then again gets a 302 error .
It again follows the redirect chain and lands on the final page.
Behavior using Selenium
It redirects finally to the main website homepage/index rather than the actual article page .Final url is not the same as actual one.
Initial basic code
chrome_options = Options()
chrome_options.add_argument('--user-agent= Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:70.0) Gecko/20100101 Firefox/70.0');
#above user agent is an example but multiple different user agents were tried
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(driver_loaction,chrome_options=chrome_options)
browser.set_page_load_timeout(45)
browser.get(url)
final_url = browser.current_url
Various attempts to get the final url instead of the main website home/index page
With normal wait
browser.get(url)
time.sleep(25)
With WebDriverWait-
WebDriverWait(browser,20)
with expected_conditions
WebDriverWait(browser, 20).until(EC.url_contains("portion of the final url"))
ends up timing out everytime , even with different conditions like url_to_be etc.
Behavior on Trying with non-selenium options
1.Wget -
Below is response from a wget call edited for obscuring actual details -
Resolving t.co (t.co)...,
... Connecting to t.co (t.co)|:443... connected. HTTP
request sent, awaiting response... 301 Moved Permanently Location:
[following]
Resolving domain (domain)...
... Connecting to ... connected.
HTTP request sent, awaiting response... 302 Found Location:
[following]
--Date-- Resolving website (website)... ip ,
connected. HTTP request sent, awaiting response... 200 OK
As seen finally we got the homepage rather than the website page .
Request library -
response = requests.get(Url, allow_redirects=True, headers=self.headers, timeout=30)
(header contains user agents but tried with actual same request headers from browser that gets the proper final url response )- gets the homepage --
checking redirects by response.history we see that from t.co(twitter url ) - we redirect to the short url then redirect to website homepage and end .
urlib library -
same final response.
Test e.g url - t.co/Mg3IYF8ZLm?amp=1 (add the https:// - i removed for posting)
After days of different approaches , i am stuck -- i somehow think that selenium is key to resolve this because it works on normal desktop browsers then should work with selenium - right?
Edit: It seems it is happening with other versions of drivers and selenium too ,it would be great if we could at least find out the actual reasoning its happening with certain links like the example given .
I have used your code only.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--user-agent= Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:70.0) Gecko/20100101 Firefox/70.0');
#above user agent is an example but multiple different user agents were tried
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.set_page_load_timeout(120)
browser.get("url")
final_url = browser.current_url
print(final_url)
Snapshot:
The issue was after the intermediate url, the url cannot redirect to the final page due to the missing headers, the intermediate url (http://www.cumhuriyet.com.tr/) is blocked with 403 status code.
Once we add the headers for the request the url is navigating to the final url.
I have given the Java implementation below. Sorry, I am not familiar with this code in Python. Please convert this code to python and it may solve the issue.
#Test
public void RUN() {
BrowserMobProxy proxy = new BrowserMobProxyServer();
proxy.start(0);
Proxy seleniumProxy = ClientUtil.createSeleniumProxy(proxy);
proxy.addRequestFilter((request, contents, messageInfo)->{
request.headers().add("Accept-Language", "en-US,en;q=0.5");
request.headers().add("Upgrade-Insecure-Requests", "1");
return null;
});
String proxyOption = "--proxy-server=" + seleniumProxy.getHttpProxy();
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless");
options.addArguments(proxyOption);
System.setProperty("webdriver.chrome.driver","<driverLocation>");
WebDriver driver = new ChromeDriver(options);
driver.get("<url>");
System.out.println(driver.getCurrentUrl());
}
Related
I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.
I want to make a GET request to a tiktok url via python but it does not work.
Let's say we have a tiktok link from a mobile app – https://vt.tiktok.com/ZS81uRSRR/ and I want to get its video_id which is available in a canonical link. This is the canonical link for the provided tiktok: https://www.tiktok.com/#notorious_foodie/video/7169643841316834566?_r=1&_t=8XdwIuoJjkX&is_from_webapp=v1&item_id=7169643841316834566
video_id comes after /video/, for example in the link above video_id would be 7169643841316834566
When I open a mobile link on my laptop in a browser it redirects me to the canonical link, I wanted to achieve the same behavior via code and managed to do it like so:
import requests
def get_canonical_url(url):
return requests.get(url, timeout=5).url
It was working for a while but then it started raising timeout errors every time, I managed to fix it by providing cookie. I made a request to Postman(it works when I make GET request through postman though), copied cookies, modified my function to accept cookies and it started working again. It had been working with cookies for ~6 months although last week it stopped working again, I thought that the reason might be in the expired cookies but when I updated them it didn't help.
This is the error I keep getting:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.tiktok.com', port=443): Read timed out. (read timeout=5)
The weirdest thing is that I can make my desired request just fine via curl:
Or via Postman:
Recap
So the problem is that my python GET request never succeeded and I can't understand why. I tried using VPN in case tiktok has banned my ip, also I tried to run this request on some of my servers to try different server locations but none of my attempts worked.
Could you give me a piece of advice how to debug this issue further or maybe any other ideas how I can get video_id out of mobile tiktok link?
Method 1 - Using subprocess
Execute curl command and catch the output and it will take ~0.5 seconds.
import subprocess
import re
process_detail = subprocess.Popen(["curl", "https://vt.tiktok.com/ZS81uRSRR/"], stdout=subprocess.PIPE)
output = process_detail.communicate()[0].decode()
process_detail.kill()
canonical_link = re.search("(?P<url>https?://[^\s]+)+\?", output).group("url")
print("Canonical link: ", canonical_link)
Method 2 - Using proxies
We need to use proxies. here is the solution for free proxies which we can scrap and apply dynamically using BeautifulSoup..
First install BeautifulSoup using pip install BeautifulSoup
Solution:
from bs4 import BeautifulSoup
import requests
def scrap_now(url):
print(f"<======================> Scrapping Started <======================>")
print(f"<======================> Getting proxy <======================>")
source = requests.get('https://free-proxy-list.net/').text
soup = BeautifulSoup(source, "html.parser")
ips_container = soup.findAll("table", {"class": "table table-striped table-bordered"})
ip_trs = ips_container[0].findAll('tr')
for i in ip_trs[1:]:
proxy_ip = i.findAll('td')[0].text + ":" + i.findAll('td')[1].text
try:
proxy = {"https": proxy_ip}
print(f"<======================> Trying with: {proxy_ip}<======================>")
headers = {'User-Agent': 'Mozilla/5.0'}
resp = requests.get(url, headers=headers, proxies=proxy, timeout=5)
if resp.status_code == requests.codes.ok:
print(f"<======================> Got Success with: {proxy_ip}<======================>")
return resp.url
except Exception as e:
print(e)
continue
return ""
canonical_link = scrap_now("https://vt.tiktok.com/ZS81uRSRR/")
print("Canonical link: ", canonical_link)
Output:
Method - 3: Using Selenium
We can do this with selenium as well. It will take almost 5 seconds
First, install selenium using pip install selenium==3.141.0
then execute below lines:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {
"profile.default_content_setting_values.media_stream_mic": 1,
"profile.default_content_setting_values.media_stream_camera": 1,
"profile.default_content_setting_values.geolocation": 1,
"profile.default_content_setting_values.notifications": 1,
"credentials_enable_service": False,
"profile.password_manager_enabled": False
})
options.add_argument('--headless')
options.add_experimental_option("excludeSwitches", ['enable-automation'])
browser = webdriver.Chrome(ChromeDriverManager(cache_valid_range=365).install(), options=options)
browser.get("https://vt.tiktok.com/ZS81uRSRR/")
print("Canonical link: ", browser.current_url)
Note: On first run it will take a bit more time as it will download web drivers automatically, but after that it will use cache only.
I am new to web scraping and would like to learn how to do it properly and politely. My problem is similar to [this][1].
'So I am trying to log into and navigate to a page using python and requests. I'm pretty sure I am getting logged in, but once I try to navigate to a page the HTML I print from that page states you must be logged in to see this page.'
I've checked robots.txt of the website I would like to scrape. Is there something which prevents me from scraping?
User-agent: *
Disallow: /caching/
Disallow: /admin3003/
Disallow: /admin5573/
Disallow: /members/
Disallow: /pp/
Disallow: /subdomains/
Disallow: /tags/
Disallow: /templates/
Disallow: /bin/
Disallow: /emails/
My code with the solution from the link above which does not work for me:
import requests
from bs4 import BeautifulSoup
login_page = <login url>
link = <required url>
payload = {
“username” = <some username>,
“password” = <some password>
}
p = requests.post(login_page, data=payload)
cookies = p.cookies
page_response = requests.get(link, cookies=cookies)
page_content = BeautifulSoup(page_response.content, "html.parser")
RequestsCookieJar shows Cookie ASP.NET_SessionId=1adqylnfxbqf5n45p0ooy345 for WEBSITE (with p.cookies command)
Output of p.status_code : 200
UPDATE:
s = requests.session()
doesn't solve my problem. I had tried this before I started looking into cookies.
Update 2:
I am trying to collect news from a particular web site. First I filtered news with a search word and saved links appeared on the first page with python requests + beautifulsoup. Now I would like to go through the links and extract news from them. The full text is possible to see with credentials only. There is no special login window and it's possible to log in via any page. There is a login button and when one move a mouse to that a login window appears as in attached image. I tried to login in both via the main page and via the page from which I would like to extract a text (not at the same time, but in different trials). None of this works.
I also tried to find csrf token by searching for “csrf_token”, “authentication_token”, “csrfmiddlewaretoken”, :csrf", "auth". Nothing was found in html on the web pages.Image
You can use requests.Session() to stay logged in but you have to save the cookie for the login as a json file. The example below shows a scrapping code that saves login session to facebook as a cookie in json format;
import selenium
import mechanicalsoup
import json
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import requests
import time
s = requests.Session()
email = raw_input("Enter your facebook login username/email: ")
password = raw_input("Enter your facebook password: ")
def get_driver():
driver = webdriver.Chrome(executable_path = 'your_path_to_chrome_driver')
driver.wait = WebDriverWait(driver, 3)
return driver
def get_url_cookie(driver):
dirver.get('https://facebook.com')
dirver.find_element_by_name('email').send_keys(email)
driver.find_element_by_name('pass').send_keys(password)
driver.find_element_by_id('loginbutton').click()
cookies_list= driver.get_cookies()
script = open('facebook_cookie.json','w')
json.dump(cookies_list,script)
driver = get_driver()
get_url_cookie(driver)
The code above gets you the login session cookie using the driver.get_cookies() and saves it as a json file. To use the cookie, just load it using;
with open('facebook_cookie.json') as c:
load = json.load(c)
for cookie in load:
s.cookie.set(cookie['name'],cookie['value'])
url = 'facebook.com/the_url_you_want_to_visit_on_facebook'
browser= mechanicalsoup.StatefulBrowser(session=s)
browser.open(url)
and you get your session loaded...
options = webdriver.ChromeOptions()
#options.add_argument('-headless')
browser = webdriver.Chrome(executable_path="./chromedriver", options=options)
browser.get("http://127.0.0.1:8080/")
print browser.title
browser.find_element_by_name('username').send_keys("admin")
browser.find_element_by_name("password").send_keys("hunter2")
browser.find_element_by_tag_name("button").click()
print browser.get_cookies()
print 'loading another page: ' + url
#example url = example.com
browser.get(url)
I'm trying to do an automated test involving CORS. So, my requirement is I login to domain A successfully and get some cookies set. This works, and I see the cookies set when I do get_cookies(). Next, I navigate to another domain B, which makes a CORS request to domain A (all CORS headers are properly set, and tested manually). But this request fails because it appears that when I navigate to domain B, the cookies are cleared, so the request is unsuccessful.
Is there any way to force cookies to not clear ?
Note : same behavior with Chrome and Firefox driver on OSX
Use browser.navigate().to(url) instead of get().
Credits: https://stackoverflow.com/a/42465954/4536543
I'm trying to make selenium capture the page source after it has fully rendered, if I go to the page and capture straight away only a bit of the page has rendered, if I put in a sleep of 30 seconds it fully renders but I want it to be more efficient.
If we use https://twitter.com/i/notifications as an example, you'll see that after 5~ seconds after the page loads there is a toast_poll and a timeline XHR request.
I want to be able to detect one of these requests and wait until one fires, then that is an indicator that the page has loaded fully.
The site that I am using fires console.log("Done") so if I could detect the console commands in PhantomJS & Firefox then this would be an even better choice than waiting for an XHR request, just wait until Done appears in the console and then that is the indicator that the page has loaded fully.
Regarding the Duplicate Flagging of this Post:
This question is in regards to PhantomJS and Firefox, the post Detect javascript console output with python is from over a year ago and the answer given only works on Chrome, I am looking for a PhantomJS and Firefox option, which I already think based on StackOverflow isn't possible so that's why my start of my post is about waiting for an XHR request.
I've already tried the following code, but it doesn't work for me.. I get zero response even though the website is throwing a console.log("Done")
from seleniumrequests import PhantomJS
from seleniumrequests import Firefox
from selenium import webdriver
import os
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
webdriver.DesiredCapabilities.PHANTOMJS['loggingPrefs'] = { 'browser':'ALL' }
browser = PhantomJS(executable_path="phantomjs.exe", service_log_path=os.path.devnull)
browser = webdriver.Firefox()
browser.set_window_size(1400, 1000)
url = "https://website.com"
browser.get(url)
for entry in browser.get_log('browser'):
print entry
I'm unable to test with browser = webdriver.Firefox() commented out because I am not sure how to have two lots of DesiredCapabilities set.
You could override the console.log function and wait for the "Done" message with execute_async_script:
from selenium import webdriver
driver = webdriver.Firefox()
driver.set_script_timeout(10)
driver.get("...")
# wait for console.log("Done") to be called
driver.execute_async_script("""
var callback = arguments[0];
console.log = function(message) {
if(message === "Done")
callback();
};
""")