Encoding issue when writing Hebrew to text file, with Python - python

Django Project
Encoding problem:
The flow of code
reading Excel file
manipulate the data from the Excel
create new txt file and write into it
send the txt file to the client
coding = "utf8"
file = open(filename, "w", encoding=coding, errors="ignore")
for row in excel_data_df.iloc():
line = manipulate(row)
file.write(line)
file.close()
file_data = open(filename, "r", encoding=coding, errors="ignore")
response = HttpResponse(file_data, content_type='application/vnd.ms-excel')
response['Content-Disposition'] = 'attachment; filename=' + filename
every thing is working just fine but when I open the the file with ANSI Encoding all the Hebrew change into gibberish
I have try to change the coding with every Hebrew option https://docs.python.org/2.4/lib/standard-encodings.html
The Hebrew coding should be write with ASCII NEW CODE or ASCII WINDOWS TEXT,
any ideas?

You need to change the mode to "rb" and remove the encoding parameter
file = open(filename, "w")
for row in excel_data_df.iloc():
line = manipulate(row)
file.write(line)
file.close()
file_data = open(filename, "rb")
response = HttpResponse(file_data, content_type='application/vnd.ms-
excel')
response['Content-Disposition'] = 'attachment; filename=' + filename

Related

Encoding any file type with python

I am trying to go through all files within a folder, read the file data encoded using utf-8, then rewriting that data to a new file which should create a copy of that file. However when doing so the new copy of the file gets corrupted.
-Should i be using utf-8 text encoding to encode all file types (.py, .txt, .docx, .jpg)?
-Is there one standard text encoding format that works for all file types?
def read_files():
files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
for file in files:
#open exsting file
f = open(file, encoding="utf-8")
file_content = f.read()
#get file name info
file_extension = file.split(".")[1]
file_name = file.split(".")[0]
#write encoded data to new file
f = open(file_name + "_converted." + file_extension , "wb")
f.write(bytes(file_content, encoding="utf-8"))
f.close()
read_files()
proper way to copy files with shutil:
import shutil
source = file
destination = file_name + "_converted." + file_extension
shutil.copy(source, destination)
bad and slow way to copy files:
def read_files():
files = ["program.py", "letter.docx", "cat.jpg", "hello_world.py"]
for file in files:
#open exsting file
f = open(file,'rb') # read file in binary mode
file_content = f.read()
f.close() # don't forget to close the file !
#get file name info
file_extension = file.split(".")[1]
file_name = file.split(".")[0]
#write raw data to new file
f = open(file_name + "_converted." + file_extension , "wb")
f.write(file_content)
f.close()
read_files()
if you don't need to decode them to text then you should only open them in binary mode, as things like jpg and docx will break in text mode and should be opened in binary mode.
alternatively if you actually need to do some work on the docx or jpg files then you should use the proper modules to do so like Pillow for jpg and docx module for docx files.

How can I manipulate a txt file to be all in lowercase in python?

Let's say that I have a txt file that I have to get all in lowercase. I tried this
def lowercase_txt(file):
file = file.casefold()
with open(file, encoding = "utf8") as f:
f.read()
Here I get "'str' object has no attribute 'read'"
then I tried
def lowercase_txt(file):
with open(poem_filename, encoding="utf8") as f:
f = f.casefold()
f.read()
and here '_io.TextIOWrapper' object has no attribute 'casefold'
What can I do?
EDIT: I re-runned this exact code and now there are no errors (dunno why) but the file doesn't change at all, all the letters stay the way they are.
This will rewrite the file. Warning: if there is some type of error in the middle of processing (power failure, you spill coffee on your computer, etc.) you could lose your file. So, you might want to first make a backup of your file:
def lowercase_txt(file_name):
"""
file_name is the full path to the file to be opened
"""
with open(file_name, 'r', encoding = "utf8") as f:
contents = f.read() # read contents of file
contents = contents.lower() # convert to lower case
with open(file_name, 'w', encoding = "utf8") as f: # open for output
f.write(contents)
For example:
lowercase_txt('/mydirectory/test_file.txt')
Update
The following version opens the file for reading and writing. After the file is read, the file position is reset to the start of the file before the contents is rewritten. This might be a safer option.
def lowercase_txt(file_name):
"""
file_name is the full path to the file to be opened
"""
with open(file_name, 'r+', encoding = "utf8") as f:
contents = f.read() # read contents of file
contents = contents.lower() # convert to lower case
f.seek(0, 0) # position back to start of file
f.write(contents)
f.truncate() # in case new encoded content is shorter than older

Format Pdftotext output object

I am extracting text from a Pdf file and I wanted to get a formatted output of the content. As the Object is made of a list sentences I thought that using textwrap.wrap would have had the job to work on the joined content.
I have tried with
for file in listoffiles
with open(file, "rb") as f:
pdf = pdftotext.PDF(f)
with open("filename" + ".txt") , 'w', encoding = 'utf-8') as f:
f.write(textwrap.wrap("\n\n".join(pdf), width = 70, ))
f.close()
I have also tried with {:<}".format("\n\n".join(pdf) instead of textwrap but it gives me back the dame type of result.
Is there any way to pass a clean file to the wrapper?

Replace newlines with a space in all files in a directory - Python

I have about 4000 txt files in a directory. I'd like to replace newlines with spaces in each file using a for loop. Actually, the script works for that purpose but when I save the file, it doesn't get saved or it gets saved with newlines again. Here is my script;
import glob
path = "path_to_files/*.txt"
for file in glob.glob(path):
with open(file, "r+") as f:
data = f.read().replace('\n', ' ')
f.write(data)
As I said I'm able to replace the newlines with a space, but at the end, it doesn't get saved. I also don't get any errors.
To further elaborate my comment ("It's almost always a bad idea to open a file in the 'r+' mode (because of the way the current position is handled). Open a file for reading, read the data, replace the newlines, open the same file file for writing, write the data"):
for file in glob.glob(path):
with open(file) as f:
data = f.read().replace('\n', ' ')
with open(file, "w") as f:
f.write(data)
You need to reset file position to 0 with seek and then truncate the leftover with truncate after you finishing writing the replacement string.
import glob
path = "path_to_files/*.txt"
for file in glob.glob(path):
with open(file, "r+") as f:
data = f.read().replace('\n', ' ')
f.seek(0)
f.write(data)
f.truncate()

Python file redirect give empty file

The code is used to convert the data from hexadecimal to binary works perfect but, but when I redirect the output to the file, the output file is empty
here is the code
for file in glob.glob("g1.txt.out"):
print file
myfile = open(file, "r")
outfile= open( file + ".binary",'a+')
for line in myfile:
data_binary="{0:16b}".format(int(line, 16))
print >> outfile,data_binary # redirect code.
Instead of redirecting the output of print, you can write directly to the output file:
with open("g1.txt.out", "r") as my_file, open("g1.txt.out.binary",'a+') as out_file:
for line in my_file:
data_binary = "{0:16b}\n".format(int(line, 16))
out_file.write(data_binary)
You need to close you files, as io is buffered. Always remember to close all your open files to save data.
for file in glob.glob("g1.txt.out"):
print file
myfile = open(file, "r")
outfile= open( file + ".binary",'a+')
for line in myfile:
data_binary="{0:16b}".format(int(line, 16))
print >> outfile,data_binary # redirect code.
myfile.close()
outfile.close()
Or even better learn with statement, which will do it automatically.
with open(filename) as f:
data = f.read()
do something with data

Categories