I would like to load a list/tuple dynamically from a settings file.
I need to write a crawler that crawls a website, but I want to be made aware of files that were found, rather than pages.
I allow the user to specify such file types in settings.py file, like this:
# Document Types during crawling
textFiles = ['.doc', '.docx', '.log', '.msg', '.pages', '.rtf', '.txt', '.wpd', '.wps']
dataFiles = ['.csv', '.dat', '.efx', '.gbr', '.key', '.pps', '.ppt', '.pptx', '.sdf', '.tax2010', '.vcf', '.xml']
audioFiles = ['.3g2','.3gp','.asf','.asx','.avi','.flv','.mov','.mp4','.mpg','.rm','.swf','.vob','.wmv']
#What lists would you like to use ?
fileLists = ['textFiles', 'dataFiles', 'audioFiles']
I import my settings file in the crawler.py
I use beautifulsoup module to find links from the HTML content and process as follows:
for item in soup.find_all("a"):
# we dont want some of them because it is just a link to the current page or the startpage
if item['href'] in dontWantList:
continue
#check if link is a file based on the fileLists from the settings
urlpath = urlparse.urlparse(item['href']).path
ext = os.path.splitext(urlpath)[1]
file = False
for list in settings.fileLists:
if ext in settings.list:
file = True
#found file link
if self.verbose:
messenger("Found a file of type: %s" % ext, Colors.PURPLE)
if ext not in fileLinks:
fileLinks.append(item['href'])
#Only add the link if it is not a file
if file is not True:
links.append(item['href'])
else:
#Do not add the file to the other lists
continue
The following code segment throws error:
for list in settings.fileLists:
if ext in settings.list:
clearly because python thinks that settings.list is a list.
Is there anyway to tell python to dynamically look in the lists from the settings file?
I think that what you are looking for is instead of:
if ext in settings.list:
You need
ext_list = getattr(settings, list)
if ext in ext_list:
EDIT:
I agree with jonrsharpe on the list thing, so I renamed it in my code
Related
I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
I have written some code to read the contents from a specific url:
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
return my_response.text
return None
This checks if there's a file named doc_ID.txt (where doc_ID could be any name provided). And if there's no such file, it would read the contents from a specific url and would return them. What I would like to do is to store those returned contents in a file called doc_ID.txt. That is, I would like to finish my function by creating a new file in case it didn't exist at the beginning.
How can I do that? I tried this:
my_text = my_response.text
output = os.rename(my_text, filename)
return output
but then, the actual contents of the file would become the name of the file and I would get an error saying the filename is too long.
So the issue I think I'm seeing is that you want to put the contents of your request's response into the file, rather than naming the file with the contents. The code below should create a file with the filename you want, and insert the text from your response!
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
with open(filename, "w") as file:
file.write(my_response.text)
return file
return None
To write the response text to the file, you can simply use python file object, https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
with open(filename, "w") as file:
file.write(my_text)
I'm new to python and trying a program to do the following:
Open all folder and subfolders in a directory path
Identify the HTML files
Load the HTML in BeautifulSoup
Find the first body tag
If the body tag is immediately followed by < Google Tag Manager> then continue
If not then add < Google Tag Manager> code and save the file.
I'm not able to scan all subfolders within each folder.
I'm not able to set seen() if < Google Tag Manager> appears immediately after the body tag.
Any help to perform the above tasks is appreciated.
My code attempt is as follows:
import sys
import os
from os import path
from bs4 import BeautifulSoup
directory_path = '/input'
files = [x for x in os.listdir(directory_path) if path.isfile(directory_path+os.sep+x)]
for root, dirs, files in os.walk(directory_path):
for fname in files:
seen = set()
a = directory_path+os.sep+fname
if fname.endswith(".html"):
with open(a) as f:
soup = BeautifulSoup(f)
for li in soup.select('body'):
if li in seen:
continue
else:
seen.add("<!-- Google Tag Manager --><noscript><iframe src='//www.googletagmanager.com/ns.html?id=GTM-54QWZ8'height='0' width='0' style='display:none;visibility:hidden'></iframe></noscript><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-54QWZ8');</script><!-- End Google Tag Manager —>\n")
So you can install the iglob library for python. With iglob you can recursively traverse the main directory you specify and the sub-directories and list all the files with a given extension. Then open up the HTML file, read all the lines, traverse through the lines manually until you find "" for the tag as some users who may use a frame work might have other content inside the body tag. Either way, loop through the lines looking for the start of the body tag, then check the next line, if the text as you specified "Google Tag Manager" is not in the next line, write it out. Please keep in mind I wrote this in the event you will always have the Google Tag Manager tags right after the body tag.
Please keep in mind that:
In the event the Google Tag Manager text is not directly after the body tag, this code will add it anyways, so if Google Tag manager is somewhere in the Two body tags, and works, this could break the functionality of your Google Tag Manager.
I am using Python 3.x for this, so if you are using Python 2, you might have to translate this to that version of python.
Replace the 'Path.html' with the variable path so that it rewrite the file it is looking at with the modifications. I put in 'path.html' so that i could see the output and compare to original while I was writing the script.
Here is the code:
import glob
types = ('*.html', '*.htm')
paths = []
for fType in types:
for filename in glob.iglob('./**/' + fType, recursive=True):
paths.append(filename)
#print(paths)
for path in paths:
print(path)
with open(path,'r') as f:
lines = f.readlines()
with open(path, 'w') as w:
for i in range(0,len(lines)):
w.write(lines[i])
if "<body>" in lines[i]:
if "<!-- Google Tag Manager -->" not in lines[i+1]:
w.write('<!-- Google Tag Manager --> <!-- End Google Tag Manager -->\n')
My take on it, might have some bugs:
edited to add: I have since realized that this code does not ensure <!-- Google Tag Manager --> is the first tag after <body>, instead it ensures it is the first comment after <body>. Which is not what the question asked for.
import fnmatch
import os
from bs4 import BeautifulSoup, Comment
from HTMLParser import HTMLParser
def get_soup(filename):
with open(filename, 'r') as myfile:
data=myfile.read()
return BeautifulSoup(data, 'lxml')
def write_soup(filename, soup):
with open(filename, "w") as file:
output = HTMLParser().unescape(soup.prettify())
file.write(output)
def needs_insertion(soup):
comments = soup.find_all(text=lambda text:isinstance(text, Comment))
try:
if comments[0] == ' Google Tag Manager ':
return False # has correct comment
else:
return True # has comments, but not correct one
except IndexError:
return True # has no comments
def get_html_files_in_dir(top_level_directory):
matches = []
for root, dirnames, filenames in os.walk(top_level_directory):
for filename in fnmatch.filter(filenames, '*.html'):
matches.append(os.path.join(root, filename))
return matches
my_html_files_path = '/home/azrad/whateveryouneedhere'
for full_file_name in get_html_files_in_dir(my_html_files_path):
soup = get_soup(full_file_name)
if needs_insertion(soup):
soup.body.insert(0, '<!-- Google Tag Manager --> <!-- End Google Tag Manager -->')
write_soup(full_file_name, soup)
I am trying to let a user upload an image, save the image to disk, and then have it display on a webpage, but I can't get the image to display properly. Here is my bin/app.py:
import web
urls = (
'/hello', 'index'
)
app = web.application(urls, globals())
render = web.template.render('templates/', base="layout")
class index:
def GET(self):
return render.hello_form()
def POST(self):
form = web.input(greet="Hello", name="Nobody", datafile={})
greeting = "%s, %s" % (form.greet, form.name)
filedir = 'absolute/path/to/directory'
filename = None
if form.datafile:
# replaces the windows-style slashes with linux ones.
filepath = form.datafile.filename.replace('\\','/')
# splits the and chooses the last part (the filename with extension)
filename = filepath.split('/')[-1]
# creates the file where the uploaded file should be stored
fout = open(filedir +'/'+ filename,'w')
# writes the uploaded file to the newly created file.
fout.write(form.datafile.file.read())
# closes the file, upload complete.
fout.close()
filename = filedir + "/" + filename
return render.index(greeting, filename)
if __name__ == "__main__":
app.run()
and here is templates/index.html:
$def with (greeting, datafile)
$if greeting:
I just wanted to say <em style="color: green; font-size: 2em;">$greeting</em>
$else:
<em>Hello</em>, world!
<br>
$if datafile:
<img src=$datafile alt="your picture">
<br>
Go Back
When I do this, I get a broken link for the image. How do I get the image to display properly? Ideally, I wouldn't have to read from disk to display it, although I'm not sure if that's possible. Also, is there a way to write the file to the relative path, instead of the absolute path?
You can also insert a path to all images in a folder by adding an entry to your URL.
URL = ('/hello','Index',
'/hello/image/(.*)','ImageDisplay'
)
...
class ImageDisplay(object):
def GET(self,fileName):
imageBinary = open("/relative/path/from/YourApp}"+fileName,'rb').read()
return imageBinary
Not the ../YourApp, not ./YourApp. It looks up one directory from where your prgram is. Now, in the html, you can use
<img src="/image/"+$datafile alt="your picture">
I would recommend using with or try with the "imageBinary = open("{..." line.
Let me know if more info is needed. This is my first response.
Sorry to ask a question in a responce, but is there a way to use a regular expression, like (.jpg) in place of the (.) I have in the URL definition?
web.py doesn't automatically serve all of the files from the directory your application is running in — if it did, anyone could be able to read your application's source code. It does, however, have a directory it serves files out of: static.
To answer your other question: yes, there is a way to avoid using an absolute path: give it a relative path!
Here's how your code might look afterwards:
filename = form.datafile.filename.replace('\\', '/').split('/')[-1]
# It might be a good idea to sanitize filename further.
# A with statement ensures that the file will be closed even if an exception is
# thrown.
with open(os.path.join('static', filename), 'wb') as f:
# shutil.copyfileobj copies the file in chunks, so it will still work if the
# file is too large to fit into memory
shutil.copyfileobj(form.datafile.file, f)
Do omit the filename = filedir + "/" + filename line. Your template need not include the absolute path: in fact, it should not; you must include static/; no more, no less:
<img src="static/$datafile" alt="your picture">
I am attempting to write a program that sends a url request to a site which then produces an animation of weather radar. I then scrape that page to get the image urls (they're stored in a Java module) and download them to a local folder. I do this iteratively over many radar stations and for two radar products. So far I have written the code to send the request, parse the html, and list the image urls. What I can't seem to do is rename and save the images locally. Beyond that, I want to make this as streamlined as possible- which is probably NOT what I've got so far. Any help 1) getting the images to download to a local folder and 2) pointing me to a more pythonic way of doing this would be great.
# import modules
import urllib2
import re
from bs4 import BeautifulSoup
##test variables##
stationName = "KBYX"
prod = ("bref1","vel1") # a tupel of both ref and vel
bkgr = "black"
duration = "1"
#home_dir = "/path/to/home/directory/folderForImages"
##program##
# This program needs to do the following:
# read the folder structure from home directory to get radar names
#left off here
list_of_folders = os.listdir(home_dir)
for each_folder in list_of_folders:
if each_folder.startswith('k'):
print each_folder
# here each folder that starts with a "k" represents a radar station, and within each folder are two other folders bref1 and vel1, the two products. I want the program to read the folders to decide which radar to retrieve the data for... so if I decide to add radars, all I have to do is add the folders to the directory tree.
# first request will be for prod[0] - base reflectivity
# second request will be for prod[1] - base velocity
# sample path:
# http://weather.rap.ucar.edu/radar/displayRad.php?icao=KMPX&prod=bref1&bkgr=black&duration=1
#base part of the path
base = "http://weather.rap.ucar.edu/radar/displayRad.php?"
#additional parameters
call = base+"icao="+stationName+"&prod="+prod[0]+"&bkgr="+bkgr+"&duration="+duration
#read in the webpage
urlContent = urllib2.urlopen(call).read()
webpage=urllib2.urlopen(call)
#parse the webpage with BeautifulSoup
soup = BeautifulSoup(urlContent)
#print (soup.prettify()) # if you want to take a look at the parsed structure
tag = soup.param.param.param.param.param.param.param #find the tag that holds all the filenames (which are nested in the PARAM tag, and
# located in the "value" parameter for PARAM name="filename")
files_in=str(tag['value'])
files = files_in.split(',') # they're in a single element, so split them by comma
directory = home_dir+"/"+stationName+"/"+prod[1]+"/"
counter = 0
for file in files: # now we should THEORETICALLY be able to iterate over them to download them... here I just print them
print file
To save images locally, something like
import os
IMAGES_OUTDIR = '/path/to/image/output/directory'
for file_url in files:
image_content = urllib2.urlopen(file_url).read()
image_outfile = os.path.join(IMAGES_OUTDIR, os.path.basename(file_url))
with open(image_outfile, 'wb') as wfh:
wfh.write(image_content)
If you want to rename them, use the name you want instead of os.path.basename(file_url).
I use these three methods for downloading from the internet:
from os import path, mkdir
from urllib import urlretrieve
def checkPath(destPath):
# Add final backslash if missing
if destPath != None and len(destPath) and destPath[-1] != '/':
destPath += '/'
if destPath != '' and not path.exists(destPath):
mkdir(destPath)
return destPath
def saveResource(data, fileName, destPath=''):
'''Saves data to file in binary write mode'''
destPath = checkPath(destPath)
with open(destPath + fileName, 'wb') as fOut:
fOut.write(data)
def downloadResource(url, fileName=None, destPath=''):
'''Saves the content at url in folder destPath as fileName'''
# Default filename
if fileName == None:
fileName = path.basename(url)
destPath = checkPath(destPath)
try:
urlretrieve(url, destPath + fileName)
except Exception as inst:
print 'Error retrieving', url
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst
There are a bunch of examples here to download images from various sites