Inconsistent file behavior

Inconsistent file behavior - python

I'm trying to track down a Python UnicodeDecodeError in the following log line:
10.210.141.123 - - [09/Nov/2011:14:41:04 -0800] "gfR\x15¢\x09ì|Äbk\x0F[×ÐÖà\x11CEÐÌy\x5C¿DÌj\x08Ï ®At\x07å!;f>\x08éPW¤\x1C\x02ö*6+\x5C\x15{,ªIkCRA\x22 xþP9â\x13h\x01¢è´\x1DzõWiË\x5C\x10sòÊ¨R)¶²\x1F8äl¾¢{ÆNw\x08÷#ï" 400 166 0.000 "-" "-"
I opened the entire log file in Vim, and then yanked the line into a new file so I could test just the one line. However, my parsing script works OK with the new file - it doesn't throw a UnicodeDecodeError. I don't understand why the one file would generate an error and the other one would not, when they are (on the surface) identical.
Here's what I tried: running enca to determine the file encoding, which complained that it Cannot determine (or understand) your language preferences. file -i says that both files are Regular files. I also deleted every other line in the original log file and still got the error in one file and no error in the other. I tried deleting
set encoding=utf-8
from my .vimrc, writing the file again, and I still got the error in one file and not in the other.
The logs are nginx logs. Nginx has this note in their release notes:
*) Change: now the 0x00-0x1F, '"' and '\' characters are escaped as \xXX
in an access_log.
Thanks to Maxim Dounin.
My Python script has with open('log_file') as f and the error comes up when I try to call json.dumps on a dict.
How can I track this down?

Your question: How can I track this down?
Answer:
(1) Show us the full text of the error message that you got -- without knowing what encoding that you were trying to use, we can't tell you anything. A traceback and a snippet of code that reads the file and reproduces the error would also be handy.
(2) Write a tiny Python script to find the line in the file and then do:
print repr(the_line) # Python 2.X
print ascii(the_line) # Python 3.x
and copy/paste the result into an edit of your question, so that we can see unambiguously what is in the line.
(3) It does look like random gibberish except for the  but do tell us whether you expect that line to be text (if so, in what human language?).

Related

Strange behavior when trying to create and write to a text file on macOS [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed last year.
I'm opening a plain text file, parsing it, and adding different lines to existing, empty string variables. I add these variables into a new variable that is a multi-line fstring. Trying to write the data to a new text file is not behaving as expected.
Reading the original file works fine. Text is properly parsed, variables populated.
The multi-line fstring variable seems fine. Prints normally. Even tried formatting it different ways which I show below.
When writing to a new file, that's where the strangeness starts. I've tried 2 ways:
Straight coding the open function with w or w+
Adding the above to a function and using that inside main()
The file is saved to disk with the correct name. Trying to double-click open in Finder produces nothing. Right-click to open produces nothing. Trying to move to trash with command+delete gives an error:
It sounds like the file goes to trash, but as the file disappears from the folder a new one is created with the same name in its place.
If I try to open in TextMate via File > Open, it opens as a blank file with no errors.
Since I can't get rid of the file, I have to delete the directory and create the directory again with the same name, or force delete in Terminal using rm. Restarting the system does not help. Relaunching Finder does nothing. Saving text files from other apps works fine. Directory is chmod 755.
If I copy an existing text file into the output directory, rename it to what the file is expected to be named, and let python overwrite the contents, it doesn't work either. The file modification date changes (and I see the file "blink" in Finder) but the contents remain the same. However, the file is not corrupted and opens normally.
If I do the same but delete the text inside of the copied file first, then run the script, python writes no data to the file, I can't open it by double-clicking on it, and I get error -43 again with the odd non-trashing behavior.
The strangest thing is this: if I add another with open() at the end of the script, and open the file that was just created and supposedly written to, and print its contents, the contents print. It's like when the script ends the file contents are being removed or its being corrupted somehow. Tried to close the file inside the script even though it's not needed, but same behavior persists.
Code:
Here's the code for writing:
FORMAT='utf-8'
OUTPUT_DIR = '/Path/To/SaveFolder'
# as a function
def write_to_file(content, fpath, name):
the_file = os.path.join(fpath, name)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(content)
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
write_to_file(multiline_var, OUTPUT_DIR, filename)
# or hard coded in main()
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
the_file = os.path.join(OUTPUT_DIR, filename)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(multiline_var)
I have tried using w w+ wt and wt+ and with and without encoding='utf-8'
Here is an example of multi-line fstring variable:
# using triple quotes
multiline_var = f"""
[PROJ-{pcode}] {full_title} by {author}
{description}
{URL}
{DIVIDER_1}
{TEXT_BLURB}
Some text here and then {SOME_MORE_TEXT}"
{DIVIDER_1}
{SOME_LINK}
"""
# or inside parens
multiline_var = (
f"[PROJ-{pcode}] {full_title} by {author}\n"
f"{description}\n\n"
f"{URL}\n"
f"{DIVIDER_1}\n"
f"{TEXT_BLURB}\n\n"
f"Some text here and then {SOME_MORE_TEXT}\n"
f"{DIVIDER_1}\n\n"
f"{SOME_LINK}"
)
Using exiftool on the text file shows the following, so it looks the data is there but must be corrupted:
File Size : 1797 bytes
File Modification Date/Time : 2021:12:31 15:55:39-05:00
File Access Date/Time : 2021:12:31 15:58:13-05:00
File Inode Change Date/Time : 2021:12:31 15:55:39-05:00
File Permissions : -rw-r--r--
File Type : TXT
File Type Extension : txt
MIME Type : text/plain
MIME Encoding : utf-8
Byte Order Mark : No
Newlines : Unix LF
Line Count : 55
Word Count : 181
Not sure what I'm doing wrong. VScode shows no syntax errors in the script. There are no errors in Terminal when running the script. Have I made some simple mistake in the above code? Maybe the fstring variable is causing a problem?

Thanks to #bnaecker for leading me to the solution to this problem.
It appeared that when creating/writing to a text file with a long name, Python can corrupt it. Not sure why, as I save long names for images with Python image libraries all the time. Using a short name like "MyFile.txt" it worked just fine, but that was a red herring.
I have updated this post with my journey to the final solution for using the long names that are needed for my project, though I'm not sure why the problem exists.
First Attempts:
So far creating using a short name and then renaming to a long one.... attempts have failed. I did notice that python is locking the file it creates and never unlocks it. Not sure if this is the problem. Setting chflags with os.system('chflags nouchg') command does not work, not even with sudo, and not even in the Terminal doing it manually.
Using os.rename() in Python corrupts the file
Using os.system('mv oldFile.txt newFile.txt') corrupts the file
Manually using mv command in Terminal corrupts the file
Manually changing the filename in the Finder does not (wtf?)
I kept looking for workarounds but nothing did the job.
Round 2:
Progress!
After much tinkering, I discovered a hidden character inside the file. I ran cat /path/longfilename.txt in Terminal, selected and copied the output and pasted into VScode. Here is what I saw:
Somehow a hidden character is getting into the project code number.
Pasting it into a Unicode search engine it came up as a ZERO WIDTH NO-BREAK SPACE also known in Unicode as EF BB BF. However, when pasting this symbol into TextMate it shows up as <U+FEFF> which is?...
The Byte Order Mark!
Opening a normal utf-8 text file in a hex editor also shows the files starting with EFBBBF for the BOM.
Now, the text file being read and parsed at first has no blank lines to start the file, so I added a line break, and also tried adding some spaces. This time when writing the file I could open it, however, after sending it to the trash, the same behavior occurred and the file was broken again. It seems that because other corrupted versions were in the trash, it added the symbol back to the file name for some reason.
So what appears to be happening, for whatever reason, when Python opens the text file I'm parsing that has no line break at the top, it seems to be grabbing the BOM from the file and adding that to the first variable which is grabbing the first line of the text file. Since that text is a number code that starts the file name, the BOM symbol is being added to the file name as well as the code inside the text file.
Just... wow
The Current Solution:
I have to leave a blank line at the start of the text file that I'm opening and parsing and a simple line break won't do it. I have no idea why this is. I added some spaces for good measure because randomly the BOM would be added to the variable and filename again. So far (knock on wood) as long as the first line of that initial file has some spaces and then a line break, and previous corrupted files have been deleted from the trash, a long file name can be used for all the files I'm creating and writing to without any problems.
This corruption even persists if I remove the encoding flag from both of the open functions I'm using (one to read and parse, the other to create and write).
If anyone knows why this is happening, please share. I've never seen it mentioned before. I'm not sure if it's a python 3.8 bug, a mac OS bug, the way TextMate wrote the original file, or a combination of these.
Correct Solution:
Thanks to #tripleee for the proper way to handle this, as I don't remember seeing this before, though I haven't been using python for very long.
In order to ignore the BOM, reading in the text file to be parsed with an encoding='utf-8-sig' does the job. Seems to be why it exists. :)
Problem solved.

How to detect a file has been truncated while reading

I'm reading lines from a group of files (log files) following them as they are written using pyinotify.
I'm opening and reading the files with python native methods:
file = open(self.file_path, 'r')
# ... later
line = file.readline()
This is generally stable and can handle the file being deleted and re-created. pyinotify will notify the unlink and subsequent link.
However some log files are not being deleted. Instead they are being truncated and new content written to the beginning of the same file.
I'm having trouble reliably detecting when this has occurred since pyinotify will simply report only a write. The only evidence I currently get is that pyinotify reports a write and readline() returns an empty string. BUT, it is possible that two subsiquent writes could trigger the same behavior.
I have thought of comparing a file's size to file.tell() but according to the documentation tell produces an opaque number and it appears this can't be trusted to be a number of bytes.
Is there a simple way to detect a file has been truncated while reading from it?
Edit:
Truncating a file can be simulated with simple shell commands:
echo hello > test.log
echo hello >> test.log
# Truncate test.log
echo goodbye > test.log
To compliment this, a simple python script can be used to confirm that file.tell() does not reduce when the file is truncated:
foo = open('./test.log', 'r')
line = foo.readline()
while line != '':
print(foo.tell())
print(line)
line = foo.readline()
# Put a breakpoint on the following line and
# truncate the file before it executes
print(foo.tell())

Use os.lseek(file.fileno(),0,os.SEEK_CUR) to obtain a byte offset without moving the file pointer. You can’t really use the regular file interface to find out, not least because it may have buffered text (that no longer exists) that it hasn’t made visible to Python yet. If the file is not a byte stream (e.g., the default open in Python 3), it could even be in the middle of a multibyte character and be unable to proceed even if the file immediately grew back past your file offset.

Python subprocess.call with multiline string EOF

I've hit a issue that I don't really understand how to overcome. I'm trying to create a subprocess in python to run another python script. Not too difficult. The issue is I'm unable to get around is EOF error when a python file includes a super long string.
Here's an example of what my files look like.
Subprocess.py:
### call longstr.py from the primary pyfile
subprocess.call(['python longstr.py'], shell = True)
Longstr.py
### called from subprocess.py
### the actual string is a lot longer; this is an example to illustrate how the string is formatted
lngstr = """here is a really long
string (which has many n3w line$ and "characters")
that are causing the python file to state the file is ending early
"""
print lngstr
Printer error in terminal
SyntaxError: EOF while scanning triple-quoted string literal
As a work around, I tried to remove all linebreaks as well as all spaces to see if it was due to it being multi-line. That still returned the same result.
My assumption is that when the subprocess is running and the shell is doing something with the file contents, when the new line is reached the shell itself is freaking out and that's what's terminating the process; not the file.
What is the correct workaround for having subprocess run a file like this?
Thank you for your help.

Answering my own question here; my problem was that I didn't file.close() before trying to execute a subprocess.call.
If you encounter this problem, and are working with recently written files this could be your issue too. Thank you to everyone who read or responded to this thread.

Python not inserting an EOF character after file close?

I'm having a strange problem where some Python code that prints to a file is not inserting an EOF character. Basically, the Python script generates runscripts to later be submitted as jobs on a cluster. I essentially wrote the entire runscript between """'s, allowing for variables to be plugged in (to vary some parameters in my simulation). I write the runscripts using the
with open(file_name, 'w') as runscrpt:
runscrpt.write("""ENTIRE_FILE_CONTENTS_HERE""")
syntax. I can give the actual code if necessary but it's not much more than above. Despite the script running fine and generating all of my runsripts, whenever I submitted them nothing happened. It took me a long time to figure out why, but it's because they're missing an EOF character. I can fix it by, for example, opening one, adding some trailing whitespace or a blank line somewhere in vim, and resaving the file.
Why isn't Python inserting the EOF character, and is there a better way to fix this than manually making trivial edits to all the files with vim?

Sounds like you mean there is no EOL (not EOF!) at the end, because that's what diff will typically tell you. Just add a newline at the end of the write (make sure there is a newline before the final """ terminator, or write a separate newline explicitly).
with open(file_name, 'w') as runscript:
runscript.write("""ENTIRE_FILE_CONTENTS_HERE\n""")
(As a bonus, I added the missing vowel.)

Why doesn't my bash script read lines from a file when called from a python script?

I am trying to write a small program in bash and part of it needs to be able to get some values from a txt file where the different files are separated by a line, and then either add each line to a variable or add each line to one array.
So far I have tried this:
FILE=$"transfer_config.csv"
while read line
do
MYARRAY[$index]="$line"
index=$(($index+1))
done < $FILE
echo ${MYARRAY[0]}
This just produces a blank line though, and not what was on the first line of the config file.
I am not returned with any errors which is why I am not too sure why this is happening.
The bash script is called though a python script using os.system("$HOME/bin/mcserver_config/server_transfer/down/createRemoteFolder"), but if I simply call it after the python program has made the file which the bash script reads, it works.
I am almost 100% sure it is not an issue with the directories, because pwd at the top of the bash script shows it in the correct directory, and the python program is also creating the data file in the correct place.
Any help is much appreciated.
EDIT:
I also tried the subprocess.call("path_to_script", shell=True) to see if it would make a difference, I know it is unlikely but it didn't.

I suspect that when calling the bash script from python, having just created the file, you are not really finished with that file: you should either explicitly close the file or use a with construct.
Otherwise, the written data is still in any buffer (from the file object, or in the OS, or wherever). Only closing (or at least flushing) the file makes sure the data is indeed in the file.
BTW, instead of os.system, you should use the subprocess module...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.