Im using this code to unshorten all urls, it works correctly but can't get it to work on this particular one "https://www.shareasale-analytics.com/u.cfm?d=654202&m=52031&u=1363577&shrsl_analytics_sscid=41k4%5F9si0z&shrsl_analytics_sstid=41k4%5F9si0z" --> URL contains aff link
response = requests.get(url, timeout=15)
if response.history:
url_new = response.url
It simply does not find the final url. The result should be https://www.gearbest.com/other-novelty-lights/pp_009234504925.html
The redirect for this particular URL is being done by JS. This works out well when using browsers, python requests module cannot follow these redirects.
I used POSTMAN to actually find this out in the first place. These are the steps i performed -
Used browser (firefox) to verify that redirect does work. -> Worked.
Used postman to see the actual response. This is what postman received -
<head></head>
<body>
<script LANGUAGE="JavaScript1.2">
window.location.replace('https:\/\/www.gearbest.com\/other-novelty-lights\/pp_009234504925.html?wid=1433363&sscid=41k4_9si0z&utm_source=shareasale&utm_medium=shareasale&utm_campaign=shareasale&sascid=41k4_9si0z&userID=1363577')
</script>
</body>
</html>
I trimmed the whitespaces from the response.
So it is clear that JS is redirecting this further.
To make this work, you will need to perform 2 steps -
Update user agent in headers so that response contains the html with information about JS redirect.
Follow the redirect yourself.
Hope, this helps!
The issue you're seeing is the redirect is performed via Javascript rather than a regular HTTP redirect; also in order to receive the JS code you need to change your user agent:
import re
import requests
url = "https://www.shareasale-analytics.com/u.cfm?d=654202&m=52031&u=1363577&shrsl_analytics_sscid=41k4%5F9si0z&shrsl_analytics_sstid=41k4%5F9si0z"
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"}
response = requests.get(url, headers=headers)
if response.history:
url_new = response.url
else:
matches = re.findall("window.location.replace\('(.*)'\)", response.content.decode(), re.DOTALL)
if matches:
match = matches[0]
url_new = match.strip().replace("\\", "")
After that just retrieve the new URL using a simple regex.
Related
I am trying to login to a site with POST method, then navigate to another page and scrape the HTML data from that second page.
However, the website is not accepting the package I am pushing to it, and the script is returning the data for a non-member landing page instead of the member page I want.
Below is the current code that does not run.
#Import Packages
import requests
from bs4 import BeautifulSoup
# Login Data
url = "https://WEBSITE.com/ajax/ajax.login.php"
data ={'username':'NAME%40MAIL.com','password':'PASSWORD%23','token':'ea83a09716ffea1a3a34a1a2195af96d2e91f4a32e4b00544db3962c33d82c40'}
# note that in HTML have encoded '#' like NAME%40MAIL.com
# note that in HTML have encoded '#' like %23
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}
# Post data package and log in
session = requests.Session()
r = session.post(url, headers=headers, data=data)
# Navigate to the next page and scrape the data
s = session.get('https://WEBSITE.com/page/93/EXAMPLE-PAGE')
soup = BeautifulSoup(s.text, 'html.parser')
print(soup)
I have inspected the Elements on the login page and the AJAX URL for the login action is correct and there are 3 forms that need to be filled as seen in the image below. I pulled the hidden token value from the inspect element panel and passed it along with the username/e-mail and password:
Inspect Element Panel
I really have no clue what the issue might be, but there is a BOOLEAN variable for IS_GUEST returning TRUE in the HTML return that tells me I have done something wrong and the script has not been granted access.
This is also puzzling to troubleshoot since there is a redirect landing page and no server error codes to analyze or give me a hint.
I am using a different header than my actual machine, but that has never stopped me before from more simple logins.
I have encoded the string passed in the login data email with '%40' instead of '#' and the special character required in the password was encoded as '%23' for '#' (i.e. NAME#MAIL.COM = 'NAME%40MAIL.COM' and PASSWORD# = 'PASSWORD%23') Whenever I change the e-mail to use the '#' I get a garbage response, and I tried putting the '#' back in the password, but that changed nothing either.
I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!
I am new to web scraping and have stumbled upon an unexpected challenge. The goal is to input an incomplete URL string for a website and "catch" the corrected URL output returned by the website's redirect function. The specific website that I referring to is Marine Traffic.
When searching for a specific vessel profile, a proper query string should contain the paramters shipid, mmsi and imo. For example, this link will return a webpage with the profile for a specific vessel:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA/_:97e0de64144a0d7abfc154ea3bd1010e
As it turns out, a query string with only the imo parameter will redirect to the exact same url. So, for example, the following query will redirect to the same one as above:
https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
My question is, using cURL in bash or another such tool such as the python requests library, how could one catch the redirect URL in an automated way? Curling the first URL returns the full html, while curling the second URL throws an Access Denied error. Why is this allowed in the browser? What is the workaround for this, if any, and what are some best practices for catching redirect URLs (using either python or bash)?
curl https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
#returns Access Denied
Note: Adding a user agent to curl --user-agent 'Chrome/79' does not get around the issue. The error is avoided but nothing is returned.
You can try .url on response object:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9337987"
r = requests.get(url, headers=headers)
print(r.url)
Prints:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA
I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.
I am new to python and programming and would really appreciate any help here.
I am trying to login to this website using the below code and I just cannot go beyond the first page.
Below is the code I have been trying...
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect')
soup = BeautifulSoup(response.text)
formtoken = soup.find('input', {'name': '__RequestVerificationToken'}).get('value')
payload = {'UserName' = username, 'Password'=password, '__RequestVerificationToken': formtoken}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
with requests.Session() as s:
p = s.post('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect', data=payload, headers=headers)
r = s.get('http://www.dell.com/account/', headers=headers)
print r.text
I am just not able to go beyond the login page. What parameters apart from login. I also tried checking the form data in the Chrome dev tool but that is encrypted. Form Data - Dev Tool screenshot
Any help here is highly appreciated.
EDIT
I have edited the code to pass token in the payload as suggested below. But I have no luck yet.
You are not following a correct approach for making a POST request.
Steps which you can follow:
First make a GET request with your URL.
Extract access token from the response.
Use that access token for your post request.