My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.
That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.
Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
import requests
## Accessing and Creating Six Digit File Code
pdf_dir = "./"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
for x in url_list["pdf"]:
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.
if '.' in x.split('/')[-1]:
... dowload(x) etc
Related
I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
I would like to convert a large batch of MS Word files into the plain text format. I have no idea how to do it in Python. I found the following code online. My path is local and all file names are like cx-xxx (i.e. c1-000, c1-001, c2-000, c2-001 etc.):
from docx import [name of file]
import io
import shutil
import os
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"c", encoding="utf-8") as textFile:
for para in document.paragraphs:
textFile.write(unicode(para.text))
path= "/home/python/resumes/"
convertDocxToText(path)
Convert docx to txt with pypandoc:
import pypandoc
# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
assert output == ""
See the official documentation here:
https://pypi.org/project/pypandoc/
You can also use the library docx2txt in Python. Here's an example:
I use glob to iter over all DOCX files in the folder.
Note: I use a little list comprehension on the original name in order to re-use it in the TXT filename.
If there's anything I've forgotten to explain, tag me and I'll edit it in.
import docx2txt
import glob
directory = glob.glob('C:/folder_name/*.docx')
for file_name in directory:
with open(file_name, 'rb') as infile:
outfile = open(file_name[:-5]+'.txt', 'w', encoding='utf-8')
doc = docx2txt.process(infile)
outfile.write(doc)
outfile.close()
infile.close()
print("=========")
print("All done!")`
GroupDocs.Conversion Cloud SDK for Python supports 50+ file formats conversion. Its free plan provides 150 free API calls monthly.
# Import module
import groupdocs_conversion_cloud
from shutil import copyfile
# Get your client_id and client_key at https://dashboard.groupdocs.cloud (free registration is required).
client_id = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxx"
client_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(client_id, client_key)
try:
#Convert DOCX to txt
# Prepare request
request = groupdocs_conversion_cloud.ConvertDocumentDirectRequest("txt", "C:/Temp/sample.docx")
# Convert
result = convert_api.convert_document_direct(request)
copyfile(result, 'C:/Temp/sample.txt')
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
This question already has answers here:
Python-FTP download all files in directory
(6 answers)
Closed 2 years ago.
I am trying to download data from the site below using python.
ftp://chain.physics.unb.ca/gps/data/nvd/
When I open the link in a browser, I am redirected to the page below.
I have the appropriate credentials to get access to the data.
After having put in the credentials, I am taken to the following page.
I am taken to the following page.
Once I have access to this page, I can loop around the URL and use the wget library to download the required data.
the URL to a datafile looks like, ftp://chain.physics.unb.ca/gps/data/nvd/arvc/2017/03/arvc060B.17_.gz.
I believe that I can pick it up from the data page with the appropriate loop to navigate around the data using the URL. I am having a really hard time trying to get through the Credential page to the URL with the FTP protocol.
Please advise.
Something like this, perhaps?
# write all files in one folder, with formatted date and time, to a text file...
import ftplib
import datetime
from datetime import datetime
ftp = ftplib.FTP('ftp.yours.com', 'u_name', 'pswd')
#ftp.nlst()
#directory = ftp.nlst('/emm1/')
ftp.cwd('')
ftp.retrlines('LIST')
filenames = []
ftp.retrlines('NLST', filenames.append)
# writes file name and modified date and file size.
with open('C:\\your_path\\test.txt', 'w') as f:
for filename in filenames:
datetimeftp = ftp.sendcmd('MDTM ' + filename)
modifiedTimeFtp = datetime.strptime(datetimeftp[4:], "%Y%m%d%H%M%S").strftime("%d %b %Y %H:%M:%S")
size = ftp.size(filename)
filesize = "{:.2f}".format(size/(1024))
f.write(filename)
f.write(':')
f.write(modifiedTimeFtp)
f.write(':')
f.write(filesize + ' KB')
f.write('\n')
f.close()
Or, maybe this?
import ftplib
from ftplib import FTP
ftp = FTP()
from datetime import datetime
filenames = []
data = []
ftp = ftplib.FTP('ftp.anything.com', 'u_name', 'ps_wd')
def get_dirs_ftp(folder=""):
contents = ftp.nlst(folder)
folders = []
for item in contents:
if "." not in item:
folders.append(item)
return folders
def get_all_dirs_ftp(folder=""):
dirs = []
new_dirs = []
new_dirs = get_dirs_ftp(folder)
while len(new_dirs) > 0:
for dir in new_dirs:
dirs.append(dir)
old_dirs = new_dirs[:]
new_dirs = []
for dir in old_dirs:
for new_dir in get_dirs_ftp(dir):
new_dirs.append(new_dir)
dirs.sort()
return dirs
#allfiles = []
# get parent and child folders in directory
all_dirs = get_all_dirs_ftp()
# create a list to append metadata
dir_list = []
for dir in all_dirs:
ftp.cwd('/'+dir+'/')
print(dir)
dir_list.append(dir)
ftp.dir(dir_list.append)
len(dir_list)
# you probably want to dump the results to a file...
outF = open('C:/your_path/filenames.csv', 'w')
for line in dir_list:
# write line to output file
outF.write(line)
outF.write("\n")
outF.close()
print('Done!!')
I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)
The output is [u''] instead of reading the content.
import re
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
for page in reader.pages:
text = page.extractText()
text_lower = text.lower()
for line in text_lower:
if re.search("abc", line):
print(line)
I use it to iterate page by page of pdf and search for key terms in it and process further.
May be this can help you to read PDF.
import pyPdf
def getPDFContent(path):
content = ""
pages = 10
p = file(path, "rb")
pdf_content = pyPdf.PdfFileReader(p)
for i in range(0, pages):
content += pdf_content.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.
Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:
from PyPDF2 import PdfFileReader
import os
def find(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')
f = open(directory, 'rb')
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
f.close()
print(contents)
The find function can be found in Nadia Alramli's answer here Find a file in python
To Read the files from Multiple Folders in a directory, below code can be used-
This Example is for reading pdf files:
import os
from tika import parser
path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
for file in f:
if ".pdf" in file: # reading only PDF files
file_join = os.path.join(r, file) #getting full path
file_data = parser.from_file(file_join) # parsing the PDF file
text = file_data['content'] # read the content
print(text) #print the content
def getTextPDF(pdfFileName,password=''):
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
from nltk import sent_tokenize
""" Extract Text from pdf """
pdf_file=open(pdfFileName,'rb')
read_pdf=PyPDF2.PdfFileReader(pdf_file)
if password !='':
read_pdf.decrypt(password)
text=[]
for i in range(0,read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
text = '\n'.join (text).replace("\n",'')
text = sent_tokenize(text)
return text
The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.
Sadly, the second one still happens for some PDFs.
Hello Rahul Pipalia,
If not install PyPDF2 in your python so first install PyPDF2 after use this module.
Installation Steps for Ubuntu (Install python-pypdf)
First, open terminal
After type sudo apt-get install python-pypdf
Your Probelm Solution
Try this below code,
# Import Library
import PyPDF2
# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#Give page number of the pdf file (How many page in pdf file).
# #param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)
page_content = page.extractText()
# Display content of the pdf
print page_content
Download the PDF from below link and try this code,
https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1
I hope my answer is helpful.
If any query so comments, please.
I have a long list of .json files that I need to download to my computer. I want to download them as .json files (so no parsing or anything like that at this point).
I have some code that works for small files, but it is pretty buggy. Also it doesn't handle multiple links well.
Appreciate any advice to fix up this code:
import os
filename = 'test.json'
path = "C:/Users//Master"
fullpath = os.path.join(path, filename)
import urllib2
url = 'https://www.premierlife.com/secure/index.json'
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close
It's creating a blank file because the f.close at the end should be f.close().
I took your code and made a little function and then called it on a little loop to go through a .txt file with the list of urls called "list_of_urls.txt" having 1 url per line (you can change the delimiter in the split function if you want to format it differently).
def save_json(url):
import os
filename = url.replace('/','').replace(':','')
# this replaces / and : in urls
path = "C:/Users/Master"
fullpath = os.path.join(path, filename)
import urllib2
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close()
And then the loop:
f = open('list_of_urls.txt')
p = f.read()
url_list = p.split('\n') #here's where \n is the line break delimiter that can be changed
for url in url_list:
save_json(url)