UTF8 conversion takes too long and crashes if file is too big - python

I have written the following python code that should convert a file to UTF8. It works well but I noticed that if the file is too big (in this case we are talking of 10GB of file!) the program crashes!
In general it seems that it takes too much time: 9minutes to convert a 2GB of text files: maybe I can make it more efficient? I think it's because I'm first reading the whole file and then save it, could be that?
import sys
import codecs
filename= sys.argv[1]
with codecs.open(filename, 'r', encoding='iso-8859-1') as f:
text = f.read()
with codecs.open(filename, 'w', encoding='utf8') as f:
f.write(text)

Yes, this may happen because you're reading the whole file in one line.
It's better to read this file by pieces, convert them to utf-8 and then write those pieces to another file.
import sys
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
sourceFileName = sys.argv[1]
targetFileName = sourceFileName + '-converted'
with codecs.open(sourceFileName, "r", "iso-8859-1") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents: break
targetFile.write(contents)
I took code from this question (And modified it a bit)

Related

Processing large .txt file in Python works only on small files

I have several 1+ gb text files of URLs. I'm trying to use Python to find and replace in order to quickly strip down the URLs.
Because these files are big, I don't want to load them into memory.
My code works on small test files of 50 lines, but when I use this code on a big text file, it actually makes the file larger.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r") as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
Small files I tested my code with result in the twitter username (as desired)
username_1
username_2
username_3
while large files result in
https://twitter.com/username_1਍ഀ
https://twitter.com/username_2਍ഀ
https://twitter.com/username_3਍ഀ
It's a problem with the encoding of the file, this works:
import re
def main():
inputfile = open("1-10_no_dups_split_2.txt", "r", encoding="UTF-16")
outputfile = open("output.txt", "a", encoding="UTF-8")
for line in inputfile:
line = re.sub("^https://twitter.com/", "", line)
outputfile.write(line)
outputfile.close()
main()
The trick being to specify UTF-16 on reading it, then output it as UTF-8. And viola, the weird stuff goes away :) I do a lot of work moving text files around with Python. There are many setting you can do to play with the encoding to automatically replace certain characters and what not, just read up about the "open" command if you get into at weird spot, or post back here :).
Doing a quick look at the results, you'll probably want to have a few regexes so you can catch https://mobile.twitter.com/ and other stuff, but that's another story.. Good luck!
You can use the open() method's buffering parameter.
Here is the code for it.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r",buffering=200000000) as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
So I am reading 20 MB of data into memory at a time.

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

Write decoded from base64 string to file

the question is that how to write string decoded from base64 to a file? I use next piece of code:
import base64
input_file = open('Input.txt', 'r')
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
output_file = open('Output.txt', 'w')
output_file.write(decoded)
output_file.close()
Input.txt contains base64 string (smth. like PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aW). After script execution I suppose to see xml in Output.txt but output file contains some wrong symbols (like <?xml version="1.0" encoding="UTF-8"?><review-case create®vFFSТ#2). At the same time if I not read from base64 string from file Input.txt but specify it in script as coded_string = '''PD94bWwgdmVyc2lvbj0iMS4wIiBlbm...''' then Output.txt contains correct xml. Is this something wrong with utf encoding? How to fix this? I use Python2.7 on Windows 7. Thanks in advance.
You probably figured out, now 5 years later, but here is the solution if anyone needs it.
import base64
with open('Input.txt', 'r') as input_file:
coded_string = input_file.read()
decoded = base64.b64decode(coded_string)
with open('Output.txt', 'w', encoding="utf-8") as output_file:
output_file.write(decoded.decode("utf-8"))
under windows you open with 'rb' instead of 'r'.
in your case your code should be :
input_file = open('Input.txt', 'rb')
instead of
input_file = open('Input.txt', 'r')
btw:
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
hope it helps

Python reading from a file and saving to utf-8

I'm having problems reading from a file, processing its string and saving to an UTF-8 File.
Here is the code:
try:
filehandle = open(filename,"r")
except:
print("Could not open file " + filename)
quit()
text = filehandle.read()
filehandle.close()
I then do some processing on the variable text.
And then
try:
writer = open(output,"w")
except:
print("Could not open file " + output)
quit()
#data = text.decode("iso 8859-15")
#writer.write(data.encode("UTF-8"))
writer.write(text)
writer.close()
This output the file perfectly but it does so in iso 8859-15 according to my editor. Since the same editor recognizes the input file (in the variable filename) as UTF-8 I don't know why this happened. As far as my reasearch has shown the commented lines should solve the problem. However when I use those lines the resulting file has gibberish in special character mainly, words with tilde as the text is in spanish. I would really appreciate any help as I am stumped....
Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):
with open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with open(filename, 'w', encoding='utf8') as f:
f.write(text)
If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:
import io
with io.open(filename, 'r', encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename, 'w', encoding='utf8') as f:
f.write(text)
You can also get through it by the code below:
file=open(completefilepath,'r',encoding='utf8',errors="ignore")
file.read()
You can't do that using open. use codecs.
when you are opening a file in python using the open built-in function you will always read/write the file in ascii. To write it in utf-8 try this:
import codecs
file = codecs.open('data.txt','w','utf-8')
The encoding parameter is what does the trick.
my_list = ['1', '2', '3', '4']
with open('test.txt', 'w', encoding='utf8') as file:
for i in my_list:
file.write(i + '\n')
You can try using utf-16, it might work.
data = pd.read_table(filename, encoding='utf-16', delimiter="\t")

Python: Problems with latin characters in output

I have a document in Spanish I'd like to format using Python. Problem is that in the output file, the accented characters are messed up, in this manner: \xc3\xad.
I succeeded in keeping the proper characters when I did some similar editing a while back, and although I've tried everything I did then and more, somehow it won't work this time.
This is current version of the code:
# -*- coding: utf-8 -*-
import re
import pickle
inputfile = open("input.txt").read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
I'm using Python 2.7 on Windows 7.
Can anyone see any obvious problems? The inputfile is encoded in utf-8, but I've tried encoding it latin-1 too. Thanks.
To clarify: My problem is that the latin characters doesn't show up properly in the output.
It's solved now, I just had to add this line as suggested by mata:
inputfile = inputfile.decode('utf-8')
it the input file is encoded in utf-8, then you should decode it first to work with it:
import re
import pickle
inputfile = open("input.txt").read()
inputfile = inputfile.decode('utf-8')
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inputfile)
outputfile = open("output.txt", "w")
pickle.dump(mylist, outputfile)
outputfile.close()
the so created file will contain a pickled version of your list. it you would rather hava a human readable file, then you might want to just use a plain file.
also a good way to deal with different encodings is using the codecs module:
import re
import codecs
with codecs.open("input.txt", "r", "utf-8") as infile:
inp = infile.read()
pat = re.compile(r"(#.*\*)")
mylist = pat.findall(inp)
with codecs.open("output.txt", "w", "utf-8") as outfile:
outfile.write("\n".join(mylist))

Categories