Python Web scrape returning [] [duplicate] - python

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!

BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html

I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Related

Why soup can only find 9 address tag while there's 20+ on the page? [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Web scraping, getting access to content that is dynamically generated after site is opened [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

I can't scrap this website, the elements does not show :c [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Why can't I get the value of a table when requesting this site in python? [duplicate]

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Can bs4 get the dynamic content of a webpage if requests can't?

So I've tried Selenium previously and now wanted to test out bs4. I tried running the following code but recieved None as an output.
res_pewdiepie = requests.get(
'https://www.youtube.com/user/PewDiePie')
soup = bs4.BeautifulSoup(res_pewdiepie.content, "lxml")
subs = soup.find(id="sub-count")
print(subs)
After researching for a while, I found out that requests doesn't load dynamic content like the subcount on YouTube or Socialblade. Is there a way to get this information with bs4 or if do I have to switch back to something like Selenium?
Thanks in advance!
BeautifulSoup can only parse a text you give it, in this case the page source. If the information is not there it can't do anything about it. So, I believe you have to switch back to something that supports javascript.
Some options:
python-selenium
requests-html
I use splash for stuff like this. You can run it in a docker container. You can tweak how long it waits for rendering on a per-request basis. There's also a scrapy plugin if you're doing any serious crawling. Here's a snippet from one of my crawlers, running Splash locally using Docker. Good luck.
target_url = "https://somewhere.example.com/"
splash_url = "http://localhost:8050/render.json"
body = json.dumps({"url": target_url, "har": 0, "html": 1, "wait": 10,})
headers = {"Content-Type": "application/json"}
response = requests.post(splash_url, data=body, headers=headers)
result = json.loads(response.text)
html = result["html"]

Categories