I have been trying to scrape a web page using bs4, however, the HTML doesn't seem to match what I can see when using the 'view page source' in Chrome. As a novice in this area, any guidance on this would be much appreciated! Details below:
An example of a target web page here and the code used is shown below.
import requests
from bs4 import BeautifulSoup
my_url = 'https://finance.yahoo.com/m/63c37511-b114-3718-a601-7e898a22439e/a-big-tech-encore-and-twitter.html'
response = requests.get(my_url)
doc = BeautifulSoup(response.text, "html.parser")
with open("output1.html", "w") as file:
file.write(str(doc))
When viewing the page source in my browser (Chrome), the snippet below is included in the html:
"siteAttribute":"ticker=\"GOOGL;AAPL;PYPL;TWTR\"
However, when looking at the file output from the code above, the siteAttribute has changed and no longer has the same information. Instead, it shows:
"siteAttribute":"wiki_topics=\"Big_Tech;Apple_Inc.;Facebook;
After researching online I can't figure out what is causing the discrepancy? Thanks in advance.
If you click on inspect from pop up box tab of chrome devtools then press ctrl + F and paste siteAttribute":"ticker=\"GOOGL;AAPL;PYPL;TWTR\ then you will see that the desired result is under a script tag. Please see the screenshot from here
Related
I tried to scrape my YouTube subscriptions list into a csv file. But I faced a problem in the middle of the code. Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.youtube.com/feed/channels'
source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')
text = soup.find_all('yt-formatted-string', {'id': 'text'})
for i in range(len(text)):
print(text[i].yt-formatted-string.text)
I am wondering why vscode didn't recognize 'yt-formatted-string' while it's found on the HTML page. Also when I tried another div from HTML, this code didn't give any output.
Your code is not working because you haven't logged in to your account while sending requests to "https://www.youtube.com/feed/channels". You must have to login first to get All subscriptions.
You can solve this problem by using selenium. First, login to your account using selenium, and then you can use either selenium or beautifulsoup to extract subscriptions from that page.
i want to copy the text from this Website (https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2), to use it later py script.
How can i do this? (It doesent realy work with request...)
If you google about python webscraping you will find a lot of information!
basically you start by executing
response = requests.get(url)
Which provides you with the html content of the webpage. Now you can use beautifulsoup to navigate through the content to get what you need.
First we need to create a soup:
soup = beautifulsoup(response.text, "lxml")
in which we can now find the content. If we for example want to find all the url's in the webpage, you can use:
soup.find_all('a')
Here is a complete example code for printing all the url's of a webpage:
import requests
from bs4 import BeautifulSoup
url = "https://google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for link in soup.find_all('a'):
print(link)
Here is the documentation of beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
As the information that Johann was looking isn't static but dynamic information, I'm making a second answer to explain how I got the info.
When visiting the webpage https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2
Open the development tools of your browser (in my case it is firefox and I'm opening by pressing F12).
When the develompent tools are open, click on the "network" tab, which will be empty at this point.
Reload the page by clicking the reload arrow or by pressing F5.
Now we can see requests being loaded in the "network" tab.
As we are looking for data being loaded after page content, we look for "xml" or "json" responses in the "type" column.
Right click the response which has either correct type and click "open page in new tab"
If multiple responses match, test all matching until you find the information you are looking for.
In this case we found https://www.reclamgymnasium.de/mobil/mobdaten/PlanKl20210618.xml?_=1623933794858
I am new to web scraping/coding, and I am trying to use Python requests/BeautifulSoup to parse through the html code in order to get some physical and chemical properties.
For some reason, although I have used the following script for other websites successfully, BeautifulSoup has only printed a few lines from the header and footer, and then pages of HTML code that doesn't really make sense. This is the code I have been using:
import requests
from bs4 import BeautifulSoup
url='https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties'
response = requests.get(url).text
soup=BeautifulSoup(response,'lxml')
print(soup.prettify())
When I try to find the table or even a row, it gives no output. Is there something I haven't accounted for? Any help would be greatly appreciated!
It is present in one of the attributes. You can extract as follows (there is a lot more info there but I subset to physical properties
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties"
r = requests.get(url)
soup = bs(r.content, 'lxml')
soup.select_one('[data-result]')['data-result']
data = json.loads(soup.select_one('[data-result]')['data-result'])
properties = data['physprop']
print(properties)
It's pretty common that if a page is populated by JavaScript after you load it requests and BeautifulSoup will not process the page correctly. The best thing to do is likely switch to the selenium module which allows your program to dynamically access the page and interact with elements. After loading (and maybe clicking on a couple elements) you can feed the HTML to BeautifulSoup and process it how you wish. The basic framework I recommend you start with would look like:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
Just click around in the browser and when you want to see if the HTML is correct, click back to your prompt and press ENTER. This is how you would locate elements automatically, so you don't have to manually interact with the page every time
I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)
http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.