import requests
from bs4 import BeautifulSoup
url = "https://www.sahibinden.com/hyundai/"
req = requests.get(url)
context = req.content
soup = BeautifulSoup(context, "html.parser")
print(soup.prettify())
I am getting an error with the above code. If I try to parse another website it works, but there is a problem with sahibinden.com . When i run the program it is waiting like 1 minute than it throws an error. I ve to parse this website. Could you please help me with explaining what the issue is?
Your problem is due to the server is expecting a user agent, can't perform the request without it.
It's possible that the error that's giving to you is a timeout?
Add the following to your code
headers_dict = {'User-Agent': user_agent}
req = requests.get(url, headers=headers_dict)
Related
I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.
I have been trying to scrape some data off pnet job site however the code does print the results of the scraped data. I run the file but nothing happens the code does not finish executing I have waited over an hour and the code has not executed. I don't know what the solution is
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.pnet.co.za/5/job-search-detailed.html?
what=Data%20analysis&where=Durban&radius=30&searchOrigin=Resultlist_top-
search&whatType=skillAutosuggest').text
print(html_text)
soup = BeautifulSoup(html_text, 'lxml')`enter code here`
You should add headers. An example is given below.
It's also better to add a timeout value, though this should run without one.
headers = {"User-Agent": "Chrome/100.0.4896.127"}
response = requests.get('https://www.pnet.co.za/5/job-search-detailed.html?what=Data%20analysis&where=Durban&radius=30&searchOrigin=Resultlist_top-search&whatType=skillAutosuggest', headers=headers, timeout=10)
print(response.text)
Code:
from bs4 import BeautifulSoup as bs
import requests
keyword = "dog"
url = f"https://www.pinterest.de/search/pins/?q={keyword}&rs=typed&term_meta[]={keyword}%7Ctyped"
r = requests.get(url)
soup = bs(r.content)
print(r)
Output: <Response [200]>
If i change the print(r) to print(soup), i get a lot of text that dosen't look like HTML.
Something like: ..."marketing_brand_charlotte_phoenix":"control","marketing_brand_houston_chicago":"enabled","marketing_brand_houston_miami_business":"enabled","marketing_brand_northnhinenestphalia_bavaria":"enabled","marketing_brand_northrhinewestphalia_bavaria_business":"enabled","marketing_brand_seattle_dallas_business":"control","marketing_brand_seattle_orlando":"control","merchant_discovery_shopify_boosting":"control","merchant_storefront_mojito_migration":"enabled","merchant_success_activation_banner_collapse_over_dismiss":"enabled","merchant_success_auto_enroll_approved_merchant":"enabled","merchant_success_catalog_activation_card_copy_update":"control","merchant_success_claim_your_website_copy_update":"control","merchant_success_i18n_umr_review_flag":"enabled","merchant_success_product_tagging":"enabled","merchant_success_switch_merchant_review_queues":"enabled","merchant_success_tag_activation_card_copy_update":"control","merchant_success_tag_installation_redirect":"enabled","merchant_success_unified_merchant_review":"enabled","mini_renux_homefeed_refresh":"enabled","more_ideas_email_notifications":"enabled","more_ideas_newshub_notifications":"enabled","more_ideas_push_notifications":"enabled","msft_pwa_announcement_email":"control","multi_format_ad_group":"enabled","mweb_account_switcher_v2":"enabled","mweb_advertiser_growth_add_biz_create_entrypoint":"enabled","mweb_all_profiles_follow_parity":"enabled","mweb_auth_android_lite_low_res_limit_width":"enabled_736x","mweb_auth_low_res_limit_width":"enabled_736x","mweb_auth_no_client_context":"enabled"...
How can I get the HTLM to then eventually scrape the pictures of a page depending on the keyword? And how to handle endless pages when scraping?
You can get the right page if you request it properly. The servers, to prevent from scraping, use some basic checks to filter them. So the problem concerns only the request not the scraping part. The first and cheapest attempt is pass a "fake" user agent to your request.
user_agent = # of firefox, chrome, safari, opera,...
response = requests.get(url, headers={ "user-agent": user_agent})
My goal is to scrape the macys.com website, and I can not get access. The following code is my initial attempt.
Attempt 1
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.macys.com').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
This resulted in the following error.
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access the requested URL on this server.
<p>Reference: 18.c503d417.1587673952.4f27a98</p>
</body>
</html>
After finding similar issues on stackoverflow, I see the most common solution is to add a header. Here is the main code from that attempt.
Attempt 2
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
Here is the last error message I have received. After researching the site, I am still unsure how to proceed.
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 586833: character maps to <undefined>
I am very intro level, so I appreciate any insight. I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
I tried your Attempt 2 code, and it works fine for me.
Try setting the BeautifulSoup's from_encoding argument to utf-8, like so:
url = 'https://www.macys.com'
headers = {'User-agent': 'Mozilla/5.0'}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content, 'lxml', from_encoding='utf-8')
print(soup)
I am also just genuinely curious why I don't have permissions for macys site as testing other sites works fine.
This is something the administrators for Macy's have done to prevent bots from accessing their website. It's an extremely trivial form of protection, though, since you only need to change the user-agent header to something typical.
Here is the url
"https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994"
Login details :
usrname : life#tech69.com
pwd : shiva#123
While opening the page with above credentials, we can get the info like
Contact details
0770228XXXX
However if adding the ?srn = true at the end of url will give the following info
(https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true)
Contact details
07702287887
The code I've used is below:
import requests
from bs4 import BeautifulSoup
s = requests.session()
login_data = dict(email='life#tech69.com', password='shiva#123')
s.post('https://my.gumtree.com/login', data=login_data)
r = s.get('https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true')
soup = BeautifulSoup(r.content, 'lxml')
y = soup.find('strong' , 'txt-large txt-emphasis form-row-label').text
print str(y)
However the above python code still giving the partial info as
0770228XXXX
How to fetch the full info using python code.
that site is protected by recaptcha, a technology that is specifically designed to prevent autologins
so the line s.post('https://my.gumtree.com/login', data=login_data)
results in this
so when you try to go to the other url you are not actually logged in, and it will not reveal the number...
there may be ways to circumvent this, but im not sure of any offhand...