How to create a text file from pdf using Python? - python

I am trying to write a block of code that does this: it first extracts text from a pdf and then creates a text file with the content in it. This is what I wrote:
import os
import pyPdf
import re
##function that extracts text from pdf
def pdfcontent(filename):
ct = ""
pdf = pyPdf.PdfFileReader(file(filename,"rb"))
for i in range(0,pdf.getNumPages()):
ct += pdf.getPage(i).extractText() + "\n"
return ct
##funcion that generates a txt file from a pdf
def pdftotxt(filename):
##first, convert pdf to txt
pdfct = pdfcontent(filename)
##fix filename problem
newfn = re.sub(".pdf", "", filename)
#now generate txt
fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")
fo.write(pdfct)
fo.close()
pdftotxt("PDFfromDocumentum.pdf")
EDIT: I fixed my previous problems and then another problem came up:
File "C:/Users/xxx/PycharmProjects/untitled/fdsa", line 22
fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")
^
SyntaxError: EOL while scanning string literal
It seems to me that Python took
fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")
as a string instead of a command. What's the solution to this problem?

If you want your script to create a new file if it does not exist use "wb" as the mode.
Refer to this for more information on using file modes.
EDIT ( Based on your edit )
The reason why you are getting EOL while parsing is that you are escaping the closing aphostrophe \' . Use backslash to escape the backslash preceding the apostrophe. I.E \\'

Despite you're using raw string you should escape last \
open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\\' + newfn + ".txt","wb")
see Python raw strings and trailing backslash for details

Related

Python open .doc file

I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?
I found that this seems to work:
import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)
I had a similar issue, the following function worked for me.
def get_string(path: Path) -> str:
string = ''
with open(path, 'rb') as stream:
stream.seek(2560)
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\x00'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum() or char == ' ':
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I tested it on a .doc file looking like the following:
picture of .doc file
The output from:
string = get_string(filepath)
print(string)
is:
The big red fox jumped over the small barrier to get to the chickens on the other side
And the chickens ran about but had no luck in surviving the day
this||||that||||The other||||

Reading csv from FTP folder

I am trying to read csv file from FTP Folder
ftp = FTP('adr')
ftp.login(user='xxxx', passwd = 'xxxxx')
r = StringIO()
ftp.retrbinary('RETR /DataLoadFolder/xxx/xxx/xxx/'+str(file_name),r.write)
r.seek(0)
csvfile1 = csv.reader(r,delimiter=';')
input_file = [list(line) for line in csv.reader(r)] ----- Error
getting an error at last line as
new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
My csv file
Text Version
There are whites spaces at the end of each row (after 17.00)
Data starts from second row
what does the error mean? Any help would be much appreciated.
The error message simply asking how you'd want to handle the newline differently due to historical reasons, you can read the explanation here.
To solve the issue, specify the newline on StringIO like this:
r = StringIO(newline='')
According to StringIO documentation. If newline is set to None, newlines are written as \n on all platforms, but universal newline decoding is still performed when reading.
I could partially reproduce and fix. The error is caused by a line containing a bad end of line. I could reproduce by adding a line \r \n at the end of an otherway valid csv file.
A simple way to fix it is to use a filter to eliminate blank lines and clean end of lines:
def filter_bytes(fd):
for line in fd:
line = line.strip()
if len(line) != 0:
yield(line + b'\r\n')
Once this is done, your code could become:
ftp = FTP('adr')
ftp.login(user='xxxx', passwd = 'xxxxx')
r = BytesIO()
ftp.retrbinary('RETR /DataLoadFolder/xxx/xxx/xxx/'+str(file_name),r.write)
r.seek(0)
csvfile1 = csv.reader(filter_bytes(r),delimiter=';')
input_file = list(csvfile1)

Saving a file with an URL as the name of the file using Python on Windows

I am having trouble with saving a file using Python on windows.
Here's the URL variable that stores the URL:
my_url = "https://example.com/some-page"
I want to remove the "https:" part and all the "/" from this string. This is what I tried:
filename = my_url.replace('https://', '')
filename = filename.replace('http://', '')
filename = filename.replace('/', '|') + ".txt"
I want to remove these characters as windows doesn't allow : and / characters as a file name.
The error that I am getting is:
Traceback (most recent call last):
File "123.py", line 28, in <module>
f = open(filename, "w")
OSError: [Errno 22] Invalid argument: 'example.com|some-page.txt'
I want to do this with multiple URLs so even though the actual link uses https I tried to remove the http too.
The pipe character ("|") is not allowed in Windows filenames either. Source: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
There is a function in urllib called urllib.parse.quote which removes special characters from urls and replaces them with their equivalent percent encoding.
urllib.parse.quote(string, safe='/', encoding=None, errors=None)
Replace special characters in string using the %xx escape. Letters, digits, and the characters '_.-' are never quoted. By default, this function is intended for quoting the path section of URL. The optional safe parameter specifies additional ASCII characters that should not be quoted — its default value is '/'.
I managed to solve the problem :)
Here's what I did:
filename = my_url.replace('https://', '')
filename = filename.replace('http://', '')
filename = filename.replace('.', '_')
filename = filename.replace('-', '_')
filename = filename.replace('/', '_') + ".txt"
Thank you!

Can't escape control character "\r" when extracting file paths

I am trying to open each of the following files separately.
"C:\recipe\1,C:\recipe\2,C:\recipe\3,"
I attempt to do this using the following code:
import sys
import os
import re
line = "C:\recipe\1,C:\recipe\2,C:\recipe\3,"
line = line.replace('\\', '\\\\') # tried to escape control chars here
line = line.replace(',', ' ')
print line # should print "C:\recipe\1 C:\recipe\2 C:\recipe\3 "
for word in line.split():
fo = open(word, "r+")
# Do file stuff
fo.close()
print "\nDone\n"
When I run it, it gives me:
fo = open(word, "r+")
IOError: [Errno 13] Permission denied: 'C:'
So it must be a result of the '\r's in the original string not escaping correctly. I tried many other methods of escaping control characters but none of them seem to be working. What am I doing wrong?
Use a raw string:
line = r"C:\recipe\1,C:\recipe\2,C:\recipe\3,"
If for whatever reason you don't use raw string, you need to escape your single slashes by adding double slash:
line = "C:\\recipe\\1,C:\\recipe\\2,C:\\recipe\\3,"
print(line.split(','))
Output:
['C:\\recipe\\1', 'C:\\recipe\\2', 'C:\\recipe\\3', '']

How to UTF-8 encode and replace text inside txt file?

I am trying to write an application that opens the txt files inside selected (sub)folder(s) and replace all the letters "ž" with letters "š" and save it in UTF-8 format.
This is what i managed to do so far (VERSION 2 - see edit):
import os
import codecs
startIn = os.getcwd()
print()
print("Pregledujem: " + startIn + "\\")
print("-------------------------")
for dirName, subdirList, fileList in os.walk(startIn):
print()
print("Trenutna mapa: " + dirName + "\\")
for fname in fileList:
if fname.endswith(".srt"):
fullpath = dirName + "\\" + fname
print(" Podnapis: " + fname )
with codecs.open(fullpath, 'r+', "UTF-8-sig") as cursub:
lines = cursub.read().replace("ž","š")
cursub.seek(0)
cursub.write(lines)
EDIT
Replacing the letters now works like it should, but I still cant figure out how to properly encode file TO utf-8.
Current version outputs the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position
220: invalid start byte
If you want to read and write open in r+ mode
cursub = codecs.open(filename, 'r+',"utf-8")
lines = cursub.read().replace("š", "ž")
cursub.seek(0) # go back to start of file
cursub.write(lines) # rewrite updated lines
Using with will close the file automatically:
with codecs.open(filename, 'r+',"utf-8") as cursub:
lines = cursub.read().replace("š", "ž")
cursub.seek(0)
cursub.write(lines)
if you are ging to edit (or rather rewrite) a file you shouldn't open it in write mode because that makes it impossible to read from it.
Either read the full file into memory first or write to a copy while reading from the original (or make a copy first and read from the copy, rewriting the original).

Categories