My goal is to scrape the macys.com website, and I can not get access. The following code is my initial attempt.
Attempt 1
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.macys.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
This resulted in the following error.
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access the requested URL on this server.
<p>Reference: 18.c503d417.1587673952.4f27a98</p>
</body>
</html>
After finding similar issues on stackoverflow, I see the most common solution is to add a header. Here is the main code from that attempt.
Attempt 2
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
Here is the last error message I have received. After researching the site, I am still unsure how to proceed.
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 586833: character maps to <undefined>
I am very intro level, so I appreciate any insight. I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
I tried your Attempt 2 code, and it works fine for me.
Try setting the BeautifulSoup's from_encoding argument to utf-8, like so:
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml', from_encoding='utf-8')
print(soup)
I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
This is something the administrators for Macy's have done to prevent bots from accessing their website. It's an extremely trivial form of protection, though, since you only need to change the user-agent header to something typical.
Related
I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.
I want to read this webpage:
http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html
If I use pd.read_html the content usually loads properly, but recently, I have started getting an HTTP Error 400: Bad Request.
So I tried to use:
link = 'http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html'
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(link, headers=header)
df = pd.read_html(r.text, encoding='utf-8')[1]
which gets over the 400 error, but the Chinese characters aren't readable, as the screenshot shows.
Why does this encoding problem occur in requests v pd.read_html, and how can I solve it? Thanks
Screenshot
I think I've solved it. Use r.content rather than r.text
Code:
from bs4 import BeautifulSoup as bs
import requests
keyword = "dog"
url = f"https://www.pinterest.de/search/pins/?q={keyword}&rs=typed&term_meta[]={keyword}%7Ctyped"
r = requests.get(url)
soup = bs(r.content)
print(r)
Output: <Response [200]>
If i change the print(r) to print(soup), i get a lot of text that dosen't look like HTML.
Something like: ..."marketing_brand_charlotte_phoenix":"control","marketing_brand_houston_chicago":"enabled","marketing_brand_houston_miami_business":"enabled","marketing_brand_northnhinenestphalia_bavaria":"enabled","marketing_brand_northrhinewestphalia_bavaria_business":"enabled","marketing_brand_seattle_dallas_business":"control","marketing_brand_seattle_orlando":"control","merchant_discovery_shopify_boosting":"control","merchant_storefront_mojito_migration":"enabled","merchant_success_activation_banner_collapse_over_dismiss":"enabled","merchant_success_auto_enroll_approved_merchant":"enabled","merchant_success_catalog_activation_card_copy_update":"control","merchant_success_claim_your_website_copy_update":"control","merchant_success_i18n_umr_review_flag":"enabled","merchant_success_product_tagging":"enabled","merchant_success_switch_merchant_review_queues":"enabled","merchant_success_tag_activation_card_copy_update":"control","merchant_success_tag_installation_redirect":"enabled","merchant_success_unified_merchant_review":"enabled","mini_renux_homefeed_refresh":"enabled","more_ideas_email_notifications":"enabled","more_ideas_newshub_notifications":"enabled","more_ideas_push_notifications":"enabled","msft_pwa_announcement_email":"control","multi_format_ad_group":"enabled","mweb_account_switcher_v2":"enabled","mweb_advertiser_growth_add_biz_create_entrypoint":"enabled","mweb_all_profiles_follow_parity":"enabled","mweb_auth_android_lite_low_res_limit_width":"enabled_736x","mweb_auth_low_res_limit_width":"enabled_736x","mweb_auth_no_client_context":"enabled"...
How can I get the HTLM to then eventually scrape the pictures of a page depending on the keyword? And how to handle endless pages when scraping?
You can get the right page if you request it properly. The servers, to prevent from scraping, use some basic checks to filter them. So the problem concerns only the request not the scraping part. The first and cheapest attempt is pass a "fake" user agent to your request.
user_agent = # of firefox, chrome, safari, opera,...
response = requests.get(url, headers={ "user-agent": user_agent})
import requests
from bs4 import BeautifulSoup
url = "https://www.sahibinden.com/hyundai/"
req = requests.get(url)
context = req.content
soup = BeautifulSoup(context, "html.parser")
print(soup.prettify())
I am getting an error with the above code. If I try to parse another website it works, but there is a problem with sahibinden.com . When i run the program it is waiting like 1 minute than it throws an error. I ve to parse this website. Could you please help me with explaining what the issue is?
Your problem is due to the server is expecting a user agent, can't perform the request without it.
It's possible that the error that's giving to you is a timeout?
Add the following to your code
headers_dict = {'User-Agent': user_agent}
req = requests.get(url, headers=headers_dict)
I try to retrieve html code from a site using the code above
url = 'http://www.somesite.com'
obj = requests.get(url, timeout=60, verify=True, allow_redirects=True)
print(obj.encoding)
print(obj.text.encode('utf-8'))
but the result I took is a strange encoding like the below text
\xb72\xc2\xacBD\xc3\xb70\xc2\xacAN\xc3\xb7n\xc2\xac~AA\xc3\xb7M1FX7q3K\xc2\xacAD\xc3\xb71414690200\xc2\xacAB\xc3\xb73\xc2\xacCR\xc3\xb73\xc2\xacAC\xc3\xb73\xc
Any ideas how can I decode the text?