Cannot find html tag I am looking for from new york times when webscraping [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
My goal is to use python to web scrape nytimes.com and find today's date.
I did some research and here is my code:
from bs4 import BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import requests
link="https://www.nytimes.com/"
response=requests.get(link)
soup=BeautifulSoup(response.text,"html.parser")
time = soup.findAll("span",{"data-testid": "todays-date"})
print(time)
This is a picture of the html from nytimes website:
screenshot from nytimes html
And this is what my terminal found after running the code - the list is empty, it could not find anything:
An empty list shown on my terminal

I think the element might be rendered via JS, so you don't find it when downloading the html via requests.
masthead = soup.find('section', {'id':'masthead-bar-one'})
What you get is
<section class="hasLinks css-1oajkic e1csuq9d3" id="masthead-bar-one"><div><div class="css-1jxco98 e1csuq9d0"></div><div class="css-bfvq22 e1csuq9d2"><a class="css-hnzl8o" href="https://www.nytimes.com/section/todayspaper">Today’s Paper</a></div></div><div class="css-103zufb" id="masthead-bar-one-widgets"><div class="css-i1s3vq e1csuq9d1"></div><div class="css-77hcv e1ll57lj2"></div></div><div class="css-9e9ivx"><a class="css-1k0lris" data-testid="login-link" href="https://myaccount.nytimes.com/auth/login?response_type=cookie&client_id=vi"></a></div></section>
No sign at all of the element you are looking for. I would suggest you to look into the selenium library in order to do this - it mocks a browser and therefore you can scrape also data generated by JS.

Related

Trying to webscrape but my code always returns None or [] [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
Im trying to scrape data from glassdoor but my beautifulsoup cant read the data at all
So far i've tried this
import requests
from bs4 import BeautifulSoup
html_text=requests.get('https://www.glassdoor.co.in/Job/data-analyst-jobs-SRCH_KO0,12.htm?fromAge=7').text
soup1=BeautifulSoup(html_text,'lxml')
soup2=soup1.prettify()
jobs=soup1.find_all('li',class_='react-job-listing css-7x0jr eigr9kq3')
print(jobs)
Ive seen solutions using selenium but is there any other way to get the actual data? ive tried this for the 'ul' class, the 'li' class and so on but nothing seems to work
There is no li tag with attribute react-job-listing css-7x0jr eigr9kq3 in the html of that url.
look html page for what you need to scrape.
for example you can try li with atrribute react-job-listing css-7x0jr eigr9kq3 which is present in html page.

Extract full HTML of a website by using pyppeteer in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I'm using the below code to extract full HTML:
cont = await page1.content()
The website I intend to extract from is:
https://www.mohmal.com/en
which is a website to make temporary email accounts. The exact thing I want to do is reading the content of received emails, but by using the above code, I could not extract inner frame HTML where received emails contents placed within it. How can I do so?
Did you try using urllib?
You can use the urllib module to read html websites.
from urllib.request import urlopen
f = urlopen("https://www.google.com")
print(f.read())
f.close()

How to extract URL of Facebook image [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Given a FB url like:
https://www.facebook.com/photo.php?fbid[LONGUSERID]&set=a.313002535549859&type=3&theater
How could I extract the real photo URL using PHP or Python?
Normally the actual URL looks like this ( as seen in Chrome Network tab)
https://scontent.fbru1-1.fna.fbcdn.net/v/t31.0-1/cp0/p32x32/11942095_139657766378816_623531952343456734_o.jpg?_nc_cat=106&_nc_sid=0081f9&_nc_ohc=VpijQtyWbUQAX-fsPMj&_nc_ht=scontent.fbru1-1.fna&oh=eb4435eed183716c807b405d0d57c3a4&oe=5F674BAB
But is there a way to automate this extraction this with script? Any example would be appreciated.
The simplest example.
I just got an HTML page, divided the text by double quotes into lines. Then I checked to see if the JPG extension was on the line.
import requests
from html import unescape
from urllib.parse import unquote
url = "https://www.facebook.com/photo.php?fbid=445552432123146"
response = requests.get(url)
if response:
lines = response.text.split('\"')
for line in lines:
if ".jpg" in line:
print(unquote(unescape(line)))
else:
print("fail!")
With the help of Selenium you can already search for elements in HTML code correctly.

How to search for links in a given page with Bash, or Python or any other popular scripts [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Given a http/https page, I would like to search for some links on that page, anyone knows how to achieve this goal with Bash, Python or any other popular script languages?
Try this in python. It will print all tags with a link:
import requests
from bs4 import BeautifulSoup as soup
print(soup(requests.get('Your link').content).find_all('a', href=True'))
You should use Beautiful Soup. It's an html parser library in python. You'll look for <a> tags and grab the inner content.

extracting html code from urls list [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to get a html code value from many urls for the same domain and as example
the html code is the name
and the domain is facebook
and the urls is just like
https://www.facebook.com/mohamed.nazem2
so if you opened that url you will see the name is Mohamed Nazem
at shown by the code :
‏‎Mohamed Nazem‎‏ ‏(ناظِم)‏
as so that facebook url
https://www.facebook.com/zuck
Mark Zuckerberg
so the value at the first url was >Mohamed Nazem<
and the second url it's Mark Zuckerberg
hopefully you got what i thinking in..
To fetch the HTML page for each url you will need to use something like the requests library. To install it, use pip install requests and then in your code use it like so:
import requests
response = requests.get('https://facebook.com/zuck')
print(response.data)

Categories