Open first video by searching in youtube, using Python - python

I try this and don't know how to open the first video. This code opens the search in the browser.
import webbrowser
def findYT(search):
words = search.split()
link = "http://www.youtube.com/results?search_query="
for i in words:
link += i + "+"
time.sleep(1)
webbrowser.open_new(link[:-1])
This successfully searches the video, but how do I open the first result?

The most common approach would be to use two very popular libraries: requests and BeautifulSoup. requests to get the page, and BeautifulSoup to parse it.
import requests
from bs4 import BeautifulSoup
import webbrowser
def findYT(search):
words = search.split()
search_link = "http://www.youtube.com/results?search_query=" + '+'.join(words)
search_result = requests.get(search_link).text
soup = BeautifulSoup(search_result, 'html.parser')
videos = soup.select(".yt-uix-tile-link")
if not videos:
raise KeyError("No video found")
link = "https://www.youtube.com" + videos[0]["href"]
webbrowser.open_new(link)
Note that it is recommended not to use uppercases in python while naming variables.

To do that you have to webscrape. Python canĀ“t see what is on your screen. You have to webscrape the youtube page you are searching and then you can open the first <a> that comes up for example. ('' is a url tag in html)
Things you need for that:
BeautifulSoup or selenium for example
requests
That should be all that you need to do what you want.

Related

Why wont BeautifulSoup extract all the HTML from a public twitter page?

I am trying to write some code to extract tweets from a public twitter page (Nike store) using the Python BS4 module. When I print the page HTML into the console, only some of the HTML is printed - when I try to search (ctrl +F) the specific class values for a tag from the console output and it returns with zero results. Why is this happening?
Here a code snippet:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import re
if __name__ == '__main__':
# Read webpage into page_html' and close connection to webpage'
first_page = 'https://twitter.com/nikestore'
url_client = urlopen(first_page)
page_html = url_client.read()
url_client.close()
print(page_html)
I came across the accepted answer in the following link. Answer also suggests using selenium to circumvent the problem.
Problem while scraping twitter using beautiful soup

How to scrape all the home page text content of a website?

So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")

Webscraping with Python ( beginner)

I'm doing the first example of the webscraping tutorial from the book "Automate the Boring Tasks with Python". The project consists of typing a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs
It mentions that I need to locate the
<h3 class="r">
element from the page source, which are the links to each search results.
The r class is used only for search result links.
But the problem is that I can't find it anywhere, even using Chrome Devtools. Any help as to where is it would be greatly appreciated.
Note: Just for reference this is the complete program as seen on the book.
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Googling..') # display text while downloading the Google page
res= requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
#Retrieve top searh result links.
soup = bs4.BeautifulSoup(res.text)
#Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
This will work for you :
>>> import requests
>>> from lxml import html
>>> r = requests.get("https://www.google.co.uk/search?q=how+to+do+web+scraping&num=10")
>>> source = html.fromstring((r.text).encode('utf-8'))
>>> links = source.xpath('//h3[#class="r"]//a//#href')
>>> for link in links:
print link.replace("/url?q=","").split("&sa=")[0]
Output :
http://newcoder.io/scrape/intro/
https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/
http://docs.python-guide.org/en/latest/scenarios/scrape/
http://webscraper.io/
https://blog.hartleybrody.com/web-scraping/
https://first-web-scraper.readthedocs.io/
https://www.youtube.com/watch%3Fv%3DE7wB__M9fdw
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
http://analystcave.com/web-scraping-tutorial/
https://en.wikipedia.org/wiki/Web_scraping
Note: I am using Python 2.7.X , for Python 3.X you just have to surround the print output like this print (link.replace("/url?q=","").split("&sa=")[0])

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Downloading files from multiple websites.

This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.
In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))
urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.

Categories