I have a set of URLs and names in a file as follows:
www.test.yom/something/somethingelse/Profile.aspx?id=1
John Doe
www.test.yom/something/somethingelse/Profile.aspx?id=24
John Benjamin
www.test.yom/something/somethingelse/Profile.aspx?id=307
Benjamin Franklin
....
Each URL page contains normal html and any amount of text, tables, etc. but will always have 1 image in an tag.
My goal is to download this image somehow to my drive, renaming it with the second line name (i.e. "John Doe.jpg" and "John Benjamin.jpg").
Is there an easy way to accomplish this? I parsed out the URL-Name file from raw HTML on a different page using UNIX commands (grep, tr, sed), but I'm guessing this will require something a bit more intricate. Right now I'm thinking Python script, but I'm not exactly sure which libraries to look at or where to start in general (although I am familiar with Python language itself). I would also be down to use Java or any other language if it would make the process easier. Any advice?
Edit: So... ran into a problem where the urls require authentication to access. This is fine but the problem is that it is two-step authentication, and the second step is a passcode sent to mobile. :-( But thanks for the help!
You can put the links in a list or a file and use requests to get the html, then use BeautifulSoup to find the image you want, extract the src attribute and use requests again to download the file. Both libraries are quite simple to use, you won't have a problem doing that simple script :).
Pseudo-code to help you start:
url_list = ['url1', 'url2']
for url in url_list:
html = requests.get(url)
soup = BeautifulSoup(html)
img_element = soup.find('img')
image_url = img_element['src']
requests.download(image_url) # Not sure how to download this to a file
You can use extraction module with requests module :
pip install requests
pip install extraction
Then:
import extraction
import requests
url = "http://google.com/"
html = requests.get(url).text
extracted = extraction.Extractor().extract(html, source_url=url)
print(extracted.image) # If you know that there is only one image in your page
print(extracted.images) # List of images on page
http://google.com/images/srpr/logo9w.png
['http://google.com/images/srpr/logo9w.png']
Note that source_url is optional in extract, but is recommendedas it makes it possible to rewrite relative urls and image urls into absolute paths.
And extracted.image is first item of extracted.images if exist, or None
This is what I ended up doing to bypass the two-step authentication. Note that for the URLs I had if I log into one of the URLs and click the "Remember Me" option on login, this avoids the login page for the following method.
Download the "Save images" extension on Firefox. Restart Firefox.
In Tools -> "Save images" -> Options. Go to "Save" tab. In "Folder Options", pick folder to save files. In "File Names", pick "Use file name:". Enter appropriate file name.
Go to "http://tejji.com/ip/open-multiple-urls.aspx" in Firefox (not Chrome).
Copy and paste only the URLs into the textbox. Click "Submit". After all tabs load, close the tejji.com tab.
On the first profile page, right click -> "Save images" -> "Save images from ALL tabs".
Close the Save prompt if everything looks right.
All the images should now be in your designated folder.
All that's left is to rename the files based on the names (the files are numbered in order which coincide with order of names if you kept URLs in same order), but that should be rudimentary.
Related
I have a bunch of folders in Dropbox with pictures in them, and I'm trying to get a list of URLs for all of the pictures in a specific folder.
import requests
import json
import dropbox
TOKEN = 'my_access_token'
dbx = dropbox.Dropbox(TOKEN)
for entry in dbx.files_list_folder('/Main/Test').entries:
# print(entry.name)
print(entry.file_requests.FileRequest.url)
# print(entry.files.Metadata.path_lower)
# print(entry.file_properties.PropertyField)
printing the entry name correctly lists all of the file names in the folder, but everything else says 'FileMetadata' object has no attribute 'get_url'.
The files_list_folder method returns a ListFolderResult, where ListFolderResult.entries is a list of Metadata. Files in particular are FileMetadata.
Also, note that you aren't guaranteed to get everything back from files_list_folder method, so make sure you implement files_list_folder_continue as well. Refer to the documentation for more information.
The kind of link you mentioned is a shared link. FileMetadata don't themselves contain a link like that. You can get the path from path_lower though. For example, in the for loop in your code, that would look like print(entry.path_lower).
You should use sharing_list_shared_links to list existing links, and/or sharing_create_shared_link_with_settings to create shared links for any particular file as needed.
I'm trying to make an automated program that downloads a certain file from a link.
Problem is, I don't know what this file will be called. Its always a .zip so for example: filename_4213432.zip . The link does not include this filename in it. It looks something like this https://link.com/api/download/433265902. Therefore its impossible to get the filename trough the link. Is there a way to fetch this name and download it?
print("link:")
url = input("> ")
request = requests.get(url, allow_redirects=True)
I'm stuck at this point because I don't know what to put in my open() now.
I need to upload a file using 'upload' button. after that a window will appear but I can't find the exact ID from HTML code. here is the screen shots and my code:
`time.sleep(1)
element=driver.find_element_by_id("Upload-Action-Ico").click()
driver.find_element_by_xpath("//*[contains(text(), 'File')]").send_keys("file path")`
I think that the ID is 'file' so I think this should work
time.sleep(1)
element=driver.find_element_by_id("file").click()
Try click on it and show the HTML code. There is a word "Button" or similar. Can you share me the url of the site?
I hope I can help you and excuse me for my english. (It isn't my mother-language)
The input field does not contain any text. And its id is explicitly mentioned in the html, so you can try find_element_by_id:
driver.find_element_by_id("file").send_keys("file path")
If this doesn't work for you, then you can try using the xpath:
driver.find_element_by_xpath("//*[#id='file']").send_keys("file path")
You can use this code to select files, and after that, you should click on the upload button.
filePath = os.getcwd()+'\img.jpg'
driver.find_element_by_id('Upload-Action-Ico').send_keys(filePath)
os.getcwd() : returns the current working directory.
img.jpg is located right next to the running script in the same directory.
I'm crawling a website and saving selected pages for offline browsing. Of course the links between pages are broken so I'd like to do something about that. Is there a somewhat simple way in python to relativize url paths like this without reinventing the wheel:
/folder1/folder2/somepage.html becomes----> folder2/somepage.html
/folder1/otherpage.html becomes----> ../otherpage.html
I understand the function would need two url paths to determine the relative path as the linked resource's path is relative to the page it appears on.
the newer pathlib (only briefly mentioned in the duplicate) also works with urls:
from pathlib import Path
abs_p = Path('https://docs.python.org/3/library/pathlib.html')
rel_p = abs_p.relative_to('https://docs.python.org/3/')
print(rel_p) # library/pathlib.html
I have to download a lot of documents from a webpage. They are wmv files, PDF, BMP etc. Of course, all of them have links to them. So each time, I have to RMC a file, select 'Save Link As' Then save then as type All Files. Is it possible to do this in Python? I search the SO DB and folks have answered question of how to get the links from the webpage. I want to download the actual files. Thanks in advance. (This is not a HW question :)).
Here is an example of how you could download some chosen files from http://pypi.python.org/pypi/xlwt
you will need to install mechanize first: http://wwwsearch.sourceforge.net/mechanize/download.html
import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()
#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.
# Open your site
br.open('http://pypi.python.org/pypi/xlwt')
f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe
filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
for t in filetypes:
if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
myfiles.append(l)
def downloadlink(l):
f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
br.click_link(l)
f.write(br.response().read())
print l.text," has been downloaded"
#br.back()
for l in myfiles:
sleep(1) #throttle so you dont hammer the site
downloadlink(l)
Note: In some cases you may wish to replace br.click_link(l) with br.follow_link(l). The difference is that click_link returns a Request object whereas follow_link will directly open the link. See Mechanize difference between br.click_link() and br.follow_link()
Follow the Python codes in this link: wget-vs-urlretrieve-of-python.
You can also do this very easily with Wget. Try --limit, --recursive and --accept command-lines in Wget. For example:
wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/