I file.readline() some registry file in order to filter some substrings out. I am making a copy of it (just to preserve original) using shutil.copyfile(), processing by foo() and see nothing filtered out. Tried debugging and the contents of lines are very binary:
'˙ţW\x00i\x00n\x00d\x00o\x00w\x00s\x00 \x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00y\x00 \x00E\x00d\x00i\x00t\x00o\x00r\x00 \x00V\x00e\x00r\x00s\x00i\x00o\x00n\x00 \x005\x00.\x000\x000\x00\n'
which is rather obvious, but was not aware of this (Notepad++ neaty presentation of text). My question is: how can I filter my strings out?
I see two options, which are reg->txt->reg approach (what I meant by the title) or converting there strings to bytes and then compare them with contents.
When I create files by hand (copy and paste contents of input file) and give them .txt, then everything works fine, but I wish it could be automated.
inputfile = "filename_in.reg"
outputfile = "filename_out.reg"
copyfile(inputfile, output file)
with open(outputfile, 'r+') as fd:
contents = fd.readlines()
for d in data:
foo(fd, d, contents)
Reg files are usually UTF-16 (usually referred to in MS documentation as "Unicode". It looks like your debug is treating the data as 8-bit characters (so there are lots of \x00 for the high order bytes of the 16-bit characters). Notepad++ can be persuaded to display UTF-16.
The fix is to tell Python that the text you are reading is in UTF-16 format:
open(outputfile, 'r+', encoding='utf16')
Related
I have spent 5 hours throughout the dark recesses of SO so I am posting this question as a last resort, and I am genuinely hoping someone can point me in the right direction here:
Scenario:
I have some .csv files (UTF-8 CSVs: verified with the file -I command) from Google surveys that are in multiple languages. Output:
download.csv: application/csv; charset=utf-8
I have a "dictionary" file that has the translations for the questions and answers (one column is the $language and the other is English).
There are LOTS of special type characters (umlauts and French accent letters, etc..) in the data from Google, because French, German, Dutch
The dictionary file I built reads fine as UTF-8 including special characters and creates the find/replace keys accurately (verified with print commands)
The issue is that the Google files only read correctly (maintain proper characters) using the csv.read function in Python. However, that function does not have a .replace and so I can do one or the other:
read in the source file, make no replacements, and get a perfect copy (not what I need)
convert the csv files/rows to a fileinput/string (UTF-8 still, mind) and get an utterly thrashed output file with missing replacements because the data "looses" the encoding between the csv read and the string somehow?
The code (here) comes closest to working, except there is no .replace method on csv.reader:
import csv
#set source, output
source = 'fr_to_trans.csv'
output = 'fr_translated.csv'
dictionary = 'frtrans.csv'
find = []
replace = []
# build the dictionary itself:
with open(dictionary, encoding='utf-8') as dict_file:
for line in dict_file:
#print(line)
temp_split = []
temp_split = line.split(',')
if "!!" in temp_split[0] :
temp_split[0] = temp_split[0].replace("!!", ",")
find.append(temp_split[0])
if "!!" in temp_split[1] :
temp_split[1] = temp_split[1].replace("!!", ",")
replace.append(temp_split [1])
#print(len(find))
#print(len(replace))
#set loop counters
check_each = len(find)
# Read in the file to parse
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
output_writer = csv.writer(t_file)
for row in csv.reader(s_file):
the_row = row
print(the_row) #THIS RETURNS THE CORRECT, FORMATTED, UTF-8 DATA
i = 0
# find and replace everything in the find array with it's value in the replace array
while i < check_each :
print(find[i])
print(replace[i])
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
i = i + 1
output_writer.writerow(the_row)
I have to assume that even though the Google files say they are UTF-8, they are a special "Google branded UTF-8" or some such nonsense. The fact that the file opens correctly with csv.reader, but then you can do nothing to it is infuriating beyond measure.
Just to clarify what I have tried:
Treat files as text and let Python sort out the encoding (fails)
Treat files as UTF-8 text (fails)
Open file as UTF-8, replace strings, and write out using the csv.writer (fails)
Convert the_row to a string, then replace, then write out with csv.writer (fails)
Quick edit - tried utf-8-sig with strings - better, but the output is still totally mangled because it isn't reading it as a csv, but strings
I have not tried:
"cell by cell" comparison instead of the whole row (working on that while this percolates on SO)
Different encoding of the file (I can only get UTF-8 CSVs so would need some sort of utility?)
If these were ASCII text I would have been done ages ago, but this whole "UTF-8 that isn't but is" thing is driving me mad. Anyone got any ideas on this?
Each row yielded by csv.reader is a list of cell values like
['42', 'spam', 'eggs']
Thus the line
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
cannot possibly work, because lists don't have a replace method.
What might work is to iterate over the row list and find/replace on each cell value (I'm assuming they are all strings)
the_row = [cell.replace(find[i], replace[i]) for cell in the row]
However, if all you want to do is replace all instances of some characters in the file with some other characters then it's simpler to open the file as a text file and replace without invoking any csv machinery:
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
for old, new in zip(find, replace):
text = text.replace(old, new)
t_file.write(text)
If the find/replace mapping is the same for all files, you can use str.translate to avoid the for loop.
# Make a reusable translation table
trans_table = str.maketrans(dict(zip(find, replace)))
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
text = text.translate(trans_table)
t_file.write(text)
For clarity: csvs are text files, only formatted so that their contents can be interpreted as rows and columns. If you want to manipulate their contents as pure text it's fine to edit them as normal text files: as long as you don't change any of the characters used as delimiters or quote marks they will still be usuable as csvs when you want to use them as such.
im trying to add the same text at the beggining of all the txt files that are in a folder.
With this code i can do it, but there is a problem, i dont know why it overwrite part of the text that is at the beginning of each txt file.
output_dir = "output"
if not os.path.exists(output_dir):
os.makedirs(output_dir)
for f in glob.glob("*.txt"):
with open(f, 'r', encoding="utf8") as inputfile:
with open('%s/%s' % (output_dir, ntpath.basename(f)), 'w', encoding="utf8") as outputfile:
for line in inputfile:
outputfile.write(line.replace(line,"more_text"+line+"text_that_is_overwrited"))
outputfile.seek(0,io.SEEK_SET)
outputfile.write('text_that_overwrite')
outputfile.seek(0, io.SEEK_END)
outputfile.write("more_text")
The content of txt files that im trying to edit start with this:
here 4 spaces text_line_1
here 4 spaces text_line_2
The result is:
On file1.txt: text_that_overwriteited
On file1.txt: text_that_overwriterited
Your mental model of how writing a file works seems to be at odds with what's actually happening here.
If you seek back to the beginning of the file, you will start overwriting all of the file. There is no such thing as writing into the middle of a file. A file - at the level of abstraction where you have open and write calls - is just a stream; seeking back to the beginning of the stream (or generally, seeking to a specific position in the stream) and writing replaces everything which was at that place in the stream before.
Granted, there is a lower level where you could actually write new bytes into a block on the disk whilst that block still remains the storage for a file which can then be read as a stream. With most modern file systems, the only way to make this work is to replace that block with exactly the same amount of data, which is very rarely feasible. In other words, you can't replace a block containing 1024 bytes with data which isn't also exactly 1024 bytes. This is so marginally useful that it's simply not an operation which is exposed to the higher level of the file system.
With that out of the way, the proper way to "replace lines" is to not write those lines at all. Instead, write the replacement, followed by whichever lines were in the original file.
It's not clear from your question what exactly you want overwritten, so this is just a sketch with some guesses around that part.
output_dir = "output"
# prefer exist_ok=True over if not os.path.exists()
os.makedirs(output_dir, exist_ok=True)
for f in glob.glob("*.txt"):
# use a single with statement
# prefer os.path.basename over ntpath.basename; use os.path.join
with open(f, 'r', encoding="utf8") as inputfile, \
open(os.path.join(output_dir, os.path.basename(f)), 'w', encoding="utf8") as outputfile:
for idx, line in enumerate(inputfile):
if idx == 0:
outputfile.write("more text")
outputfile.write(line.rstrip('\n'))
outputfile.write("text that is overwritten\n")
continue
# else:
outputfile.write(line)
outputfile.write("more_text\n")
Given an input file like
here is some text
here is some more text
this will create an output file like
more texthere is some texttext that is overwritten
here is some more text
more_text
where the first line is a modified version of the original first line, and a new line is appended after the original file's contents.
I found this elsewhere on StackOverflow. Why does my text file keep overwriting the data on it?
Essentially, the w mode is meant to overwrite text.
Also, you seem to be writing a sitemap manually. If you are using a web framework like Flask or Django, they have plugin or built-in support for auto-generated sitemaps — you should use that instead. Alternatively, you could create an XML template for the sitemap using Jinja or DTL. Templates are not just for HTML files.
I have a Python script that is reading files as input which contains \ characters, alters some of the content, and writes it to another output file. A simplified version of this script looks like this:
inputFile = open(sys.argv[1], 'r')
input = inputFile.read()
outputFile = open(sys.argv[2], 'w')
outputFile.write(input.upper())
Given, this content in input file:
My name\'s Bob
the output is:
MY NAME\\'S BOB
instead of:
MY NAME'S BOB
I suspect that this is because of the input file's format because direct string input yields desirable results (e.g. outputFile.write(('My name\'s Bob').upper())). This does not occur for all files (e.g. .txt files work, but .js files don't). Because I am reading different files as text files, the ideal solution should not require that input file be of certain type, so is there a better way to read files? This leads me to question whether I should use different read/write functions.
Thanks in advance for all help
I'm using Notepad ++ to do a find and replacement function. Currently I have a a huge numbers of text files. I need to do a replacement for different string in different file. I want do it in batch. For example.
I have a folder that has the huge number of text file. I have another text file that has the strings for find and replace in order
Text1 Text1-corrected
Text2 Text2-corrected
I have a small script that do this replacement only for the opened files in Notepad++. For achieving this I'm using python script in Notepad++. The code is as follows.
with open('C:/replace.txt') as f:
for l in f:
s = l.split()
editor.replace(s[0], s[1])
In simple words, the find and replace function should fetch the input from a file.
Thanks in advance.
with open('replace.txt') as f:
replacements = [tuple(line.split()) for line in f]
for filename in filenames:
with open(filename, 'w') as f:
contents = f.read()
for old, new in replacements:
contents = contents.replace(old, new)
f.write(contents)
Read replacements into a list of tuples, then go through each file, and read the contents into memory, do the replacements, then write it back. I think the files get overwritten properly, but you might want to double check.
I need to do some manipulation of a number of pdf files. As a first step I wanted to copy them from a single directory into a tree that supports my needs. I used the following code
for doc in docList:
# these steps just create the directory structure I need from the file name
fileName = doc.split('\\')[-1]
ID = fileName.split('_')[0]
basedate = fileName.split('.')[0].split('_')[-1].strip()
rdate = '\\R' + basedate + '-' +'C' + basedate
newID = str(cikDict[ID])
newpath = basePath + newID + rdate
# check existence of the new path
if not os.path.isdir(newpath):
os.makedirs(newpath)
# reads the file in and then writes it to the new directory
fstring = open(doc).read()
outref = open(newpath +'\\' + fileName, 'wb')
outref.write(fstring)
outref.close()
When I run this code the directories are created and the there are files with the correct name in each directory. However, when I click to open a file I get an error from Acrobat informing me that the file was damaged and could not be repaired.
I was able to copy the files using
shutil.copy(doc,newpath)
To replace the last four lines - but I have not been able to figure out why I can't read the file as a string and then write it in a new location.
One thing I did was compare what was read from the source to what the file content was after a read after it had been written:
>>> newstring = open(newpath + '\\' +fileName).read()
>>> newstring == fstring
True
So it does not appear the content was changed?
I have not been able to figure out why I can't read the file as a string and then write it in a new location.
Please be aware that PDF is a binary file format, not a text file format. Methods treating files (or data in general) as text may change it in different ways, especially:
Reading data as text interprets bytes and byte sequences as characters according to some character encoding. Writing text back as data again transforms according some character encoding, too.
If the applied encodings differ, the result obviously differs from the original file. But even if the same encoding was used, differences can creep in: If the original file contains bytes which have no meaning in the applied encoding, some replacement character is used instead and the final result file contains the encoding of that replacement character, not the original byte sequence. Furthermore some encodings have multiple possible encodings for the same character. Thus, some input byte sequence may be replaced by some other sequence representing the same character in the output.
End-of-line sequences may be changed according to the preferences of the platform.
Binary files may contain different byte sequences used as end-of-line marker on one or the other platform, e.g. CR, LF, CRLF, ... Methods treating the data as text may replace all of them by the one sequence favored on the local platform. But as these bytes in binary files may have a different meaning than end-of-line, this replacement may be destructive.
Control characters in general may be ignored
In many encodings the bytes 0..31 have meanings as control characters. Methods treating binary data as text may interpret them somehow which may result in a changed output again.
All these changes can utterly destroy binary data, e.g. compressed streams inside PDFs.
You could try using binary mode for reading files by also opening them with a b in the mode string. Using binary mode both while reading and writing may solve your issue.
One thing I did was compare what was read from the source to what the file content was after a read after it had been written:
>>> newstring = open(newpath + '\\' +fileName).read()
>>> newstring == fstring
True
So it does not appear the content was changed?
Your comparison also reads the files as text. Thus, you do not compare the actual byte contents of the original and the copied file but their interpretations according to the encoding assumed while reading them. So damage has already been done on both sides of your comparison.
You should use shutil to copy files. It is platform aware and you avoid problems like this.
But you already discovered that.
You would be better served using with to open and close files. Then the files are opened and closed automatically. It is more idiomatic:
with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
fout.write(fin.read()) # the ENTIRE file is read with .read()
If potentially you are dealing with a large file, read and write in chunks:
with open(doc, 'rb') as fin, open(fn_out, 'wb') as fout:
while True:
chunk=fin.read(1024)
if chunk:
fout.write(chunk)
else:
break
Note the 'rb' and 'wb' arguments to open. Since you are clearly opening this file under Windows, that prevents the interpretation of the file into a Windows string.
You should also use os.path.join rather than newpath + '\\' +fileName type operation.