I'm downloading Excel files from a website using beautifulsoup4.
I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is.
the function takes in a beautifulsoup call, searches for <a> then makes a call to the links.
def save_excel_files(sfile):
print("starting")
for link in sfile.find_all("a"):
candidate_link = link.get("href")
if (candidate_link is not None
and "Flat.File" in candidate_link):
xfile = requests.get(candidate_link)
if xfile:
### I just don't know what to do...
I've tried using os.path ; with open("xtest", "wb") as f: and many other variations. Been at this for two evenings and totally stuck.
The first issue is that I can't even get the files to downlaod and save anywhere. xfile resolves to [response 200], so that part is working, I'm just having a hard time coding the actual download and save.
Something like this should've worked :
xfile = requests.get(candidate_link)
file_name = candidate_link.split('/')[-1]
if xfile:
with open(file_name, "wb") as f:
f.write(xfile.content)
Tested with the following link I found randomly in google :
candidate_link = "http://berkeleycollege.edu/browser_check/samples/excel.xls"
Related
I will be the first to tell you that my Python skills are beginner at best, so please forgive my ignorance here.
By way of background, I have created a Python script in Anaconda Jupyter Notebooks that reads a single PDF from a folder, C:\Users\...\PDFs , extracts the text of said PDF, and then through some splicing puts the text of interest into a CSV file that it creates.
The problem is that I want to execute this script on hundreds of PDFs (the ipynb script itself works just fine when executed on individual PDFs, I just don't want to keep manually changing the file name in the Notebook/Python script). Using pdfreader, my script starts with the following:
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
fd = open(r'C:Users\...\PDFs\[pdf name].pdf', 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
This is where I get stuck - I cannot figure out how to run this on/import all PDFs in the folder. I have seen some people use a variable file name with an asterisk, eg C:\Users\...\PDFs\*.pdf, however I can't get that to go. It seems like like it might be possible to save my ipynb as a py file, and then somehow run it in Anaconda Prompt, however I have struggled getting this method to work as well. I am unfamiliar with bat files, but those too seem potentially promising.
Does anyone know of a way to run this script on many PDFs in a single directory at once? I have scrounged around a ton, but for the life of me cannot figure this out. Any help would be greatly appreciated! :)
You can use the glob module to gather all of the files names, then loop through them.
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
from glob import glob
pdf_files = glob(r'C:Users\...\PDFs\*.pdf')
for path in pdf_files:
fd = open(path, 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
...
fd.close()
I'm trying to make an automated program that downloads a certain file from a link.
Problem is, I don't know what this file will be called. Its always a .zip so for example: filename_4213432.zip . The link does not include this filename in it. It looks something like this https://link.com/api/download/433265902. Therefore its impossible to get the filename trough the link. Is there a way to fetch this name and download it?
print("link:")
url = input("> ")
request = requests.get(url, allow_redirects=True)
I'm stuck at this point because I don't know what to put in my open() now.
I am working on this project that involves downloading a daily .csv. I have successfully written the code to download the .csv file through selenium. however, I am having trouble changing the directory when running the entire code.
The code in question is as follows:
download_purchases = driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[3]/div/div[2]')
download_purchases.click()
fp = os.path.expanduser('~')+'/Desktop/Export_Purchasing/CSV/'
os.chdir(fp)
files = [f for f in os.listdir(fp)]
When I run the whole syntax up unto this point, the files list comprehension produces an empty list. However, when i re run it (after having tried to run the whole code from the start), the list comprehension is able to detect the download .csv.
How can i make it so that the files are detected on the first pass? I tried quitting the driver with:
driver.quit()
but this didn't fix the problem.
It looks like the files are probably not downloaded by the time are hitting your last line: files = [f for f in os.listdir(fp)]. To test this you can add in a sleep like:
import time
download_purchases = driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[3]/div/div[2]')
download_purchases.click()
fp = os.path.expanduser('~')+'/Desktop/Export_Purchasing/CSV/'
os.chdir(fp)
print("Printed immediately.")
time.sleep(10)
files = [f for f in os.listdir(fp)]
If that works then you know that it is merely a timing issue and you can employ a more sophisticated solution to continue after the download is complete.
I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.
The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.
I have a python program that downloads article text and then turns it into a txt file. The program currently spits out the txt files in the directory the program is located in. I would like to arrange this text in folders specific to the news source they came from. Could I save the data in the folder in the python program itself and change the directory as the news source file changes? Or should I create a shell script that runs the python program inside the folder it needs to be in? Or is there a better way to sort these files that I am missing?
Here is the code of the Python program:
import feedparser
from goose import Goose
import urllib2
import codecs
url = "http://rss.cnn.com/rss/cnn_tech.rss"
feed = feedparser.parse(url)
g = Goose()
entryLength = len(feed['entries'])
count = 0
while True:
article = g.extract(feed.entries[count]['link'])
title = article.title
text = article.cleaned_text
file = codecs.open(feed['entries'][count]['title'] + ".txt", 'w', encoding = 'utf-8')
file.write(text)
file.close()
count = count + 1
if count == entryLength:
break
If you only give your save functions filenames, they will save to the current directory. However, if you provide them with paths, your files will end up there. Python takes care of it.
folder = 'whatever' #the folder you wish to save the files in
name = 'somefilename.txt'
filename = os.path.join(folder, filename)
Using that filename will make the file end up in the folder 'whatever/'
Edit: I see you've posted your code now. As br1ckb0t mentioned in his comment below, in your code you could write something like codecs.open(folder + feed['entries'].... Make sure to append a slash to folder if you do that, or it'll just end up as part of the filename.