Trying to webscrape but my code always returns None or [] [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
Im trying to scrape data from glassdoor but my beautifulsoup cant read the data at all
So far i've tried this
import requests
from bs4 import BeautifulSoup
html_text=requests.get('https://www.glassdoor.co.in/Job/data-analyst-jobs-SRCH_KO0,12.htm?fromAge=7').text
soup1=BeautifulSoup(html_text,'lxml')
soup2=soup1.prettify()
jobs=soup1.find_all('li',class_='react-job-listing css-7x0jr eigr9kq3')
print(jobs)
Ive seen solutions using selenium but is there any other way to get the actual data? ive tried this for the 'ul' class, the 'li' class and so on but nothing seems to work

There is no li tag with attribute react-job-listing css-7x0jr eigr9kq3 in the html of that url.
look html page for what you need to scrape.
for example you can try li with atrribute react-job-listing css-7x0jr eigr9kq3 which is present in html page.

Related

Cannot find html tag I am looking for from new york times when webscraping [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
My goal is to use python to web scrape nytimes.com and find today's date.
I did some research and here is my code:
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import requests
link="https://www.nytimes.com/"
response=requests.get(link)
soup=BeautifulSoup(response.text,"html.parser")
time = soup.findAll("span",{"data-testid": "todays-date"})
print(time)
This is a picture of the html from nytimes website:
screenshot from nytimes html
And this is what my terminal found after running the code - the list is empty, it could not find anything:
An empty list shown on my terminal
I think the element might be rendered via JS, so you don't find it when downloading the html via requests.
masthead = soup.find('section', {'id':'masthead-bar-one'})
What you get is
<section class="hasLinks css-1oajkic e1csuq9d3" id="masthead-bar-one"><div><div class="css-1jxco98 e1csuq9d0"></div><div class="css-bfvq22 e1csuq9d2"><a class="css-hnzl8o" href="https://www.nytimes.com/section/todayspaper">Today’s Paper</a></div></div><div class="css-103zufb" id="masthead-bar-one-widgets"><div class="css-i1s3vq e1csuq9d1"></div><div class="css-77hcv e1ll57lj2"></div></div><div class="css-9e9ivx"><a class="css-1k0lris" data-testid="login-link" href="https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi"></a></div></section>
No sign at all of the element you are looking for. I would suggest you to look into the selenium library in order to do this - it mocks a browser and therefore you can scrape also data generated by JS.

HTML - How to scrape not visible elements using python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm using beautiful soup to scrape a webpages.
I am trying to scrape data from this https://painel-covid19.saude.ma.gov.br/vacinas. But the problem is I am getting the tags in outputs empty. In the Inspect Element I can see the data, but in page source not. You can see the code is hidden in . How can I retrieve it using python? Someone can help me?
The issue isn't "not visible". The issue is that the data is being filled in by Javascript code. You won't see the data unless you are executing the Javascript on the page. You can do that with the selenium package, which runs a copy of Chrome to do the rendering.

How to extract URL of Facebook image [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Given a FB url like:
https://www.facebook.com/photo.php?fbid[LONGUSERID]&set=a.313002535549859&type=3&theater
How could I extract the real photo URL using PHP or Python?
Normally the actual URL looks like this ( as seen in Chrome Network tab)
https://scontent.fbru1-1.fna.fbcdn.net/v/t31.0-1/cp0/p32x32/11942095_139657766378816_623531952343456734_o.jpg?_nc_cat=106&_nc_sid=0081f9&_nc_ohc=VpijQtyWbUQAX-fsPMj&_nc_ht=scontent.fbru1-1.fna&oh=eb4435eed183716c807b405d0d57c3a4&oe=5F674BAB
But is there a way to automate this extraction this with script? Any example would be appreciated.
The simplest example.
I just got an HTML page, divided the text by double quotes into lines. Then I checked to see if the JPG extension was on the line.
import requests
from html import unescape
from urllib.parse import unquote
url = "https://www.facebook.com/photo.php?fbid=445552432123146"
response = requests.get(url)
if response:
lines = response.text.split('\"')
for line in lines:
if ".jpg" in line:
print(unquote(unescape(line)))
else:
print("fail!")
With the help of Selenium you can already search for elements in HTML code correctly.

How to search for links in a given page with Bash, or Python or any other popular scripts [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Given a http/https page, I would like to search for some links on that page, anyone knows how to achieve this goal with Bash, Python or any other popular script languages?
Try this in python. It will print all tags with a link:
import requests
from bs4 import BeautifulSoup as soup
print(soup(requests.get('Your link').content).find_all('a', href=True'))
You should use Beautiful Soup. It's an html parser library in python. You'll look for <a> tags and grab the inner content.

extracting html code from urls list [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to get a html code value from many urls for the same domain and as example
the html code is the name
and the domain is facebook
and the urls is just like
https://www.facebook.com/mohamed.nazem2
so if you opened that url you will see the name is Mohamed Nazem
at shown by the code :
‏‎Mohamed Nazem‎‏ ‏(ناظِم)‏
as so that facebook url
https://www.facebook.com/zuck
Mark Zuckerberg
so the value at the first url was >Mohamed Nazem<
and the second url it's Mark Zuckerberg
hopefully you got what i thinking in..
To fetch the HTML page for each url you will need to use something like the requests library. To install it, use pip install requests and then in your code use it like so:
import requests
response = requests.get('https://facebook.com/zuck')
print(response.data)

Categories