Downloading files from multiple websites. - python

This is my first Python project so it is very basic and rudimentary.
I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.
Questions:
I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.
Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?
import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup
# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []
# Find exe files to download
match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)
# Check links
#def findexe():
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))
As you can see, I have left the function commented out as I cannot get it to work correctly.
Should I abandon the list and just download them individually? I was trying to be efficient.
Any suggestions or if you could point me in the right direction, it would be most appreciated.

In addition to mikez302's answer, here's a slightly more readable way to write your code:
import os
import re
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
websites = [
'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
'http://www.simplysup.com/tremover/download.html'
]
download_links = []
for url in websites:
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection)
connection.close()
for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
download_links.append(link['href'])
for url in download_links:
urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))

urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:
for url in urllist:
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
# Check links
for link in page.findAll('a'):
try:
href = link['href']
if re.search(match, href):
urllist2.append(href)
except KeyError:
pass
os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))

The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:
import os
import re
import urllib
import urllib2
from bs4 import BeautifulSoup
url = ''
connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here
regex = '(.+?).zip' #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
link[x]=i.split(' ')[len(i.split(' '))-1]
# When it finds all the .zip, it usually comes back with a lot of undesirable
# text, luckily the file name is almost always separated by a space from the
# rest of the text which is why we do the split
x+=1
os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download
for i in link:
urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.
This is not as efficient as the codes in the previous answers but it will work for most almost any site.

Related

Scraping the URLs of dynamically changing images from a website

I'm creating a python program that collects images from this website by Google
The images on the website change after a certain number of seconds, and the image url also changes with time. This change is handled by a script on the website. I have no idea how to get the image links from it.
I tried using BeautifulSoup and the requests library to get the image links from the site's html code:
import requests
from bs4 import BeautifulSoup
url = 'https://clients3.google.com/cast/chromecast/home'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag)
But the code returns:
{{background_url}}' in the image src ("ng-src")
For example:
<img class="S9aygc-AHe6Kc" id="picture-background" image-error-handler="" image-index="0" ng-if="backgroundUrl" ng-src="{{backgroundUrl}}"/>
How can I get the image links from a dynamically changing site? Can BeautifulSoup handle this? If not what library will do the job?
import requests
import re
def main(url):
r = requests.get(url)
match = re.search(r"(lh4\.googl.+?mv)", r.text).group(1)
match = match.replace("\\", "").replace("u003d", "=")
print(match)
main("https://clients3.google.com/cast/chromecast/home")
Just a minor addition to the answer by αԋɱҽԃ αмєяιcαη (ahmed american) in case anyone is wondering
The subdomain (lhx) in lhx.google.com is also dynamic. As a result, the link can be lh3 or lh4 et cetera.
This code fixes the problem:
import requests
import re
r = requests.get("https://clients3.google.com/cast/chromecast/home").text
match = re.search(r"(lh.\.googl.+?mv)", r).group(1)
match = match.replace('\\', '').replace("u003d", "=")
print(match)
The major difference is that the lh4 in the code by ahmed american has been replaced with "lh." so that all images can be collected no matter the url.
EDIT: This line does not work:
match = match.replace('\\', '').replace("u003d", "=")
Replace with:
match = match.replace("\\", "")
match = match.replace("u003d", "=")
None of the provided answers worked for me. Issues may be related to using an older version of python and/or the source page changing some things around.
Also, this will return all matches instead of only the first match.
Tested in Python 3.9.6.
import requests
import re
url = 'https://clients3.google.com/cast/chromecast/home'
r = requests.get(url)
for match in re.finditer(r"(ccp-lh\..+?mv)", r.text, re.S):
image_link = 'https://%s' % (match.group(1).replace("\\", "").replace("u003d", "="))
print(image_link)

Open first video by searching in youtube, using Python

I try this and don't know how to open the first video. This code opens the search in the browser.
import webbrowser
def findYT(search):
words = search.split()
link = "http://www.youtube.com/results?search_query="
for i in words:
link += i + "+"
time.sleep(1)
webbrowser.open_new(link[:-1])
This successfully searches the video, but how do I open the first result?
The most common approach would be to use two very popular libraries: requests and BeautifulSoup. requests to get the page, and BeautifulSoup to parse it.
import requests
from bs4 import BeautifulSoup
import webbrowser
def findYT(search):
words = search.split()
search_link = "http://www.youtube.com/results?search_query=" + '+'.join(words)
search_result = requests.get(search_link).text
soup = BeautifulSoup(search_result, 'html.parser')
videos = soup.select(".yt-uix-tile-link")
if not videos:
raise KeyError("No video found")
link = "https://www.youtube.com" + videos[0]["href"]
webbrowser.open_new(link)
Note that it is recommended not to use uppercases in python while naming variables.
To do that you have to webscrape. Python can´t see what is on your screen. You have to webscrape the youtube page you are searching and then you can open the first <a> that comes up for example. ('' is a url tag in html)
Things you need for that:
BeautifulSoup or selenium for example
requests
That should be all that you need to do what you want.

How to scrape all the home page text content of a website?

So I am new to webscraping, I want to scrape all the text content of only the home page.
this is my code, but it now working correctly.
from bs4 import BeautifulSoup
import requests
website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")
full_text = soup.find_all()
print(full_text)
When I print "full_text" it give me a lot of html content but not all, when I ctrl + f " traiteurcheminfaisant#hotmail.com" the email adress that is on the home page (footer)
is not found on full_text.
Thanks you for helping!
A quick glance at the website that you're attempting to scrape from makes me suspect that not all content is loaded when sending a simple get request via the requests module. In other words, it seems likely that some components on the site, such as the footer you mentioned, are being loaded asynchronously with Javascript.
If that is the case, you'll probably want to use some sort of automation tool to navigate to the page, wait for it to load and then parse the fully loaded source code. For this, the most common tool would be Selenium. It can be a bit tricky to set up the first time since you'll also need to install a separate webdriver for whatever browser you'd like to use. That said, the last time I set this up it was pretty easy. Here's a rough example of what this might look like for you (once you've got Selenium properly set up):
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
full_text = soup.find_all()
print(full_text)
I haven't used BeatifulSoup before, but try using urlopen instead. This will store the webpage as a string, which you can use to find the email.
from urllib.request import urlopen
try:
response = urlopen("http://www.traiteurcheminfaisant.com")
html = response.read().decode(encoding = "UTF8", errors='ignore')
print(html.find("traiteurcheminfaisant#hotmail.com"))
except:
print("Cannot open webpage")

Searching through HTML pages for certain text?

I wanted to play around with python to learn it, so I'm taking on a little project, but a part of it requires me to search for a name on this list:
https://bughunter.withgoogle.com/characterlist/1
(the number one is to be incremented by one every time to search for the name)
So I will be HTML scraping it, I'm new to python and would appreciate if someone could give me an example of how to make this work.
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://bughunter.withgoogle.com'
def get_page_html(page_num):
r = requests.get('{}/characterlist/{}'.format(URL, page_num))
r.raise_for_status()
return r.text
def get_page_profiles(page_html):
page_profiles = {}
soup = BeautifulSoup(page_html)
for table_cell in soup.find_all('td'):
profile_name = table_cell.find_next('h2').text
profile_url = table_cell.find_next('a')['href']
page_profiles[profile_name] = '{}{}'.format(URL, profile_url)
return page_profiles
if __name__ == '__main__':
all_profiles = {}
for page_number in range(1, 81):
current_page_html = get_page_html(page_number)
current_page_profiles = get_page_profiles(current_page_html)
all_profiles.update(current_page_profiles)
with open('google_hall_of_fame_profiles.json', 'w') as f:
json.dump(all_profiles, f, indent=2)
Your question wasn't clear about how you wanted the data structured after scraping so I just saved the profiles in a dict (with the key/value pair as {profile_name: profile_url}) and then dumped the results to a json file.
Let me know if anything is unclear!
Try this. You will need to install bs4 first (python 3). It will get all of the names of the people on the website page:
from bs4 import BeautifulSoup as soup
import urllib.request
text=str(urllib.request.urlopen('https://bughunter.withgoogle.com/characterlist/1').read())
text=soup(text)
print(text.findAll(class_='item-list')[0].get_text())

Write a python script that goes through the links on a page recursively

I'm doing a project for my school in which I would like to compare scam mails. I found this website: http://www.419scam.org/emails/
Now what I would like to do is to save every scam in apart documents then later on I can analyse them.
Here is my code so far:
import BeautifulSoup, urllib2
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
This saves me the whole html file in a text format, now I would like to strip the file and save the content of the html links to the scams:
01
02
03
etc.
If i get that, I would still need to go a step further and open save another href. Any idea how do I do it in one python code?
Thank you!
You picked the right tool in BeautifulSoup. Technically you could do it all do it in one script, but you might want to segment it, because it looks like you'll be dealing with tens of thousands of e-mails, all of which are seperate requests - and that will take a while.
This page is gonna help you a lot, but here's just a little code snippet to get you started. This gets all of the html tags that are index pages for the e-mails, extracts their href links and appends a bit to the front of the url so they can be accessed directly.
from bs4 import BeautifulSoup
import re
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/"))
tags = soup.find_all(href=re.compile("20......../index\.htm")
links = []
for t in tags:
links.append("http://www.419scam.org/emails/" + t['href'])
're' is a Python's regular expressions module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute match that regular expression. I chose this regular expression to get only the e-mail index pages rather than all of the href links on that page. I noticed that the index page links had that pattern for all of their URLs.
Having all the proper 'a' tags, I then looped through them, extracting the string from the href attribute by doing t['href'] and appending the rest of the URL to the front of the string, to get raw string URLs.
Reading through that documentation, you should get an idea of how to expand these techniques to grab the individual e-mails.
You might also find value in requests and lxml.html. Requests is another way to make http requests and lxml is an alternative for parsing xml and html content.
There are many ways to search the html document but you might want to start with cssselect.
import requests
from lxml.html import fromstring
url = 'http://www.419scam.org/emails/'
doc = fromstring(requests.get(url).content)
atags = doc.cssselect('a')
# using .get('href', '') syntax because not all a tags will have an href
hrefs = (a.attrib.get('href', '') for a in atags)
Or as suggested in the comments using .iterlinks(). Note that you will still need to filter if you only want 'a' tags. Either way the .make_links_absolute() call is probably going to be helpful. It is your homework though, so play around with it.
doc.make_links_absolute(base_url=url)
hrefs = (l[2] for l in doc.iterlinks() if l[0].tag == 'a')
Next up for you... how to loop through and open all of the individual spam links.
To get all links on the page you could use BeautifulSoup. Take a look at this page, it can help. It actually tells how to do exactly what you need.
To save all pages, you could do the same as what you do in your current code, but within a loop that would iterate over all links you'll have extracted and stored, say, in a list.
Heres a solution using lxml + XPath and urllib2 :
#!/usr/bin/env python2 -u
# -*- coding: utf8 -*-
import cookielib, urllib2
from lxml import etree
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open("http://www.419scam.org/emails/")
page.addheaders = [('User-agent', 'Mozilla/5.0')]
reddit = etree.HTML(page.read())
# XPath expression : we get all links under body/p[2] containing *.htm
for node in reddit.xpath('/html/body/p[2]/a[contains(#href,".htm")]'):
for i in node.items():
url = 'http://www.419scam.org/emails/' + i[1]
page = opener.open(url)
page.addheaders = [('User-agent', 'Mozilla/5.0')]
lst = url.split('/')
try:
if lst[6]: # else it's a "month" link
filename = '/tmp/' + url.split('/')[4] + '-' + url.split('/')[5]
f = open(filename, 'w')
f.write(page.read())
f.close()
except:
pass
# vim:ts=4:sw=4
You could use HTML parser and specify the type of object you are searching for.
from HTMLParser import HTMLParser
import urllib2
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr in attrs:
if attr[0] == 'href':
print attr[1]
address='http://www.419scam.org/emails/'
html = urllib2.urlopen(address).read()
f = open('test.txt', 'wb')
f.write(html)
f.close()
parser = MyHTMLParser()
parser.feed(html)

Categories