source code of web page not available using urllib.urlopen() - python

I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib

To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()

Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.

works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]

we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)

Related

How to copy the text from certain Website with python?

i want to copy the text from this Website (https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2), to use it later py script.
How can i do this? (It doesent realy work with request...)
If you google about python webscraping you will find a lot of information!
basically you start by executing
response = requests.get(url)
Which provides you with the html content of the webpage. Now you can use beautifulsoup to navigate through the content to get what you need.
First we need to create a soup:
soup = beautifulsoup(response.text, "lxml")
in which we can now find the content. If we for example want to find all the url's in the webpage, you can use:
soup.find_all('a')
Here is a complete example code for printing all the url's of a webpage:
import requests
from bs4 import BeautifulSoup
url = "https://google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for link in soup.find_all('a'):
print(link)
Here is the documentation of beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
As the information that Johann was looking isn't static but dynamic information, I'm making a second answer to explain how I got the info.
When visiting the webpage https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2
Open the development tools of your browser (in my case it is firefox and I'm opening by pressing F12).
When the develompent tools are open, click on the "network" tab, which will be empty at this point.
Reload the page by clicking the reload arrow or by pressing F5.
Now we can see requests being loaded in the "network" tab.
As we are looking for data being loaded after page content, we look for "xml" or "json" responses in the "type" column.
Right click the response which has either correct type and click "open page in new tab"
If multiple responses match, test all matching until you find the information you are looking for.
In this case we found https://www.reclamgymnasium.de/mobil/mobdaten/PlanKl20210618.xml?_=1623933794858

How to scrape all the home page text content of a website?

So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")

How to fix Python requests/BeautifulSoup response from database

I am new to web scraping/coding, and I am trying to use Python requests/BeautifulSoup to parse through the html code in order to get some physical and chemical properties.
For some reason, although I have used the following script for other websites successfully, BeautifulSoup has only printed a few lines from the header and footer, and then pages of HTML code that doesn't really make sense. This is the code I have been using:
import requests
from bs4 import BeautifulSoup
url='https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties'
response = requests.get(url).text
soup=BeautifulSoup(response,'lxml')
print(soup.prettify())
When I try to find the table or even a row, it gives no output. Is there something I haven't accounted for? Any help would be greatly appreciated!
It is present in one of the attributes. You can extract as follows (there is a lot more info there but I subset to physical properties
import requests
from bs4 import BeautifulSoup as bs
import json
url = "https://comptox.epa.gov/dashboard/dsstoxdb/results?search=ammonia#properties"
r = requests.get(url)
soup = bs(r.content, 'lxml')
soup.select_one('[data-result]')['data-result']
data = json.loads(soup.select_one('[data-result]')['data-result'])
properties = data['physprop']
print(properties)
It's pretty common that if a page is populated by JavaScript after you load it requests and BeautifulSoup will not process the page correctly. The best thing to do is likely switch to the selenium module which allows your program to dynamically access the page and interact with elements. After loading (and maybe clicking on a couple elements) you can feed the HTML to BeautifulSoup and process it how you wish. The basic framework I recommend you start with would look like:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
Just click around in the browser and when you want to see if the HTML is correct, click back to your prompt and press ENTER. This is how you would locate elements automatically, so you don't have to manually interact with the page every time

How to use BeautifulSoup to scrape a webpage url

I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.
The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network tab of Web Developer in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001&lt=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information.
The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above.
But, I always prefer more then less!

Python equivalent of full webpage download

I'm trying to create a basic scraper that will scrape Username and Song Title from a search on Soundcloud. By inspecting the element I needed (using Chrome), I found I needed to find the string associated with every tag 'span' with title="soundTitle__usernameText". Using BeautifulSoup, urllib2, and lxml, I have the following code for a search 'robert delong':
from lxml import html
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
def search_results(url):
html = urlopen(url).read()
# html = requests.get(url) I've tried this also
soup = BeautifulSoup(html, "lxml")
usernames = [span.string for span in soup.find_all("span", "soundTitle__usernameText")]
return usernames
print search_results('http://soundcloud.com/search?q=robert%20delong')
This returns an empty list. However, when I save the complete webpage on Chrome by selecting File>Save>Format-Webpage, Complete, and use that associated HTML file instead of the file obtained with urlopen, the code then prints
[u'Two Door Cinema Club', u'whatever-28', u'AWOLNATION', u'Two Door Cinema Club', u'Sean Glass', u'Capital Cities', u'Robert DeLong', u'RAC', u'JR JR']
which is the ideal outcome. To me, it appears that urlopen uses somewhat truncated HTML code to conduct its search, which is why it returns an empty list.
Any thoughts on how I may be able to access the same HTML obtained by manually saving the webpage, but using Python/Terminal? Thank you.
You guessed right. Downloaded HTML does not contain all the data. Javascript is used to request information in JSON format which is then inserted into the document.
By looking at the request Chrome made (ctrl+shift+i, "Network"), I see that it requested https://api-v2.soundcloud.com/search?q=robert%20delong. I believe the response to that has the information you need.
Actually, this is good for you. Reading JSON should me much more straight-forward than parsing HTML ;)
This is the code that you can use to download the html of the webpage using terminal and its related links and images:
wget -p --convert-links http://www.website.com/directory/webpage.html

Categories