Which newline character is in my CSV? - python

We receive a .tar.gz file from a client every day and I am rewriting our import process using SSIS. One of the first steps in my process is to unzip the .tar.gz file which I achieve via a Python script.
After unzipping we are left with a number of CSV files which I then import into SQL Server. As an aside, I am loading using the CozyRoc DataFlow Task Plus.
Most of my CSV files load without issue but I have five files which fail. By reading the log I can see that the process is reading the Header and First line as though there is no HeaderRow Delimiter (i.e. it is trying to import the column header as ColumnHeader1ColumnValue1
I took one of these CSVs, copied the top 5 rows into Excel, used Text-To-Columns to delimit the data then saved that as a new CSV file.
This version imported successfully.
That makes me think that somehow the original CSV isn't using {CR}{LF} as the row delimiter but I don't know how to check. Any suggestions?

I ended up using the suggestion commented by #vahdet because I already had notepad++ installed. I can't find the same option in EmEditor but it may exist
For those who are curious, the files are using {LF} which is consistent with the other files. My investigation continues...

Seeing that you have EmEditor, you can use EmEditor to find the eol character in two ways:
Use View > Character Code Value... at the end of a line to display a dialog box showing information about the character at the current position.
Go to View > Marks and turn on Newline Characters and CR and LF with Different Marks to show the eol while editing. LF is displayed with a down arrow while CRLF is a right angle.
Some other things you could try checking for are: file encoding, wrong type of data for a field and an inconsistent number of columns.

Related

Strange behavior when trying to create and write to a text file on macOS [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed last year.
I'm opening a plain text file, parsing it, and adding different lines to existing, empty string variables. I add these variables into a new variable that is a multi-line fstring. Trying to write the data to a new text file is not behaving as expected.
Reading the original file works fine. Text is properly parsed, variables populated.
The multi-line fstring variable seems fine. Prints normally. Even tried formatting it different ways which I show below.
When writing to a new file, that's where the strangeness starts. I've tried 2 ways:
Straight coding the open function with w or w+
Adding the above to a function and using that inside main()
The file is saved to disk with the correct name. Trying to double-click open in Finder produces nothing. Right-click to open produces nothing. Trying to move to trash with command+delete gives an error:
It sounds like the file goes to trash, but as the file disappears from the folder a new one is created with the same name in its place.
If I try to open in TextMate via File > Open, it opens as a blank file with no errors.
Since I can't get rid of the file, I have to delete the directory and create the directory again with the same name, or force delete in Terminal using rm. Restarting the system does not help. Relaunching Finder does nothing. Saving text files from other apps works fine. Directory is chmod 755.
If I copy an existing text file into the output directory, rename it to what the file is expected to be named, and let python overwrite the contents, it doesn't work either. The file modification date changes (and I see the file "blink" in Finder) but the contents remain the same. However, the file is not corrupted and opens normally.
If I do the same but delete the text inside of the copied file first, then run the script, python writes no data to the file, I can't open it by double-clicking on it, and I get error -43 again with the odd non-trashing behavior.
The strangest thing is this: if I add another with open() at the end of the script, and open the file that was just created and supposedly written to, and print its contents, the contents print. It's like when the script ends the file contents are being removed or its being corrupted somehow. Tried to close the file inside the script even though it's not needed, but same behavior persists.
Code:
Here's the code for writing:
FORMAT='utf-8'
OUTPUT_DIR = '/Path/To/SaveFolder'
# as a function
def write_to_file(content, fpath, name):
the_file = os.path.join(fpath, name)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(content)
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
write_to_file(multiline_var, OUTPUT_DIR, filename)
# or hard coded in main()
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
the_file = os.path.join(OUTPUT_DIR, filename)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(multiline_var)
I have tried using w w+ wt and wt+ and with and without encoding='utf-8'
Here is an example of multi-line fstring variable:
# using triple quotes
multiline_var = f"""
[PROJ-{pcode}] {full_title} by {author}
{description}
{URL}
{DIVIDER_1}
{TEXT_BLURB}
Some text here and then {SOME_MORE_TEXT}"
{DIVIDER_1}
{SOME_LINK}
"""
# or inside parens
multiline_var = (
f"[PROJ-{pcode}] {full_title} by {author}\n"
f"{description}\n\n"
f"{URL}\n"
f"{DIVIDER_1}\n"
f"{TEXT_BLURB}\n\n"
f"Some text here and then {SOME_MORE_TEXT}\n"
f"{DIVIDER_1}\n\n"
f"{SOME_LINK}"
)
Using exiftool on the text file shows the following, so it looks the data is there but must be corrupted:
File Size : 1797 bytes
File Modification Date/Time : 2021:12:31 15:55:39-05:00
File Access Date/Time : 2021:12:31 15:58:13-05:00
File Inode Change Date/Time : 2021:12:31 15:55:39-05:00
File Permissions : -rw-r--r--
File Type : TXT
File Type Extension : txt
MIME Type : text/plain
MIME Encoding : utf-8
Byte Order Mark : No
Newlines : Unix LF
Line Count : 55
Word Count : 181
Not sure what I'm doing wrong. VScode shows no syntax errors in the script. There are no errors in Terminal when running the script. Have I made some simple mistake in the above code? Maybe the fstring variable is causing a problem?
Thanks to #bnaecker for leading me to the solution to this problem.
It appeared that when creating/writing to a text file with a long name, Python can corrupt it. Not sure why, as I save long names for images with Python image libraries all the time. Using a short name like "MyFile.txt" it worked just fine, but that was a red herring.
I have updated this post with my journey to the final solution for using the long names that are needed for my project, though I'm not sure why the problem exists.
First Attempts:
So far creating using a short name and then renaming to a long one.... attempts have failed. I did notice that python is locking the file it creates and never unlocks it. Not sure if this is the problem. Setting chflags with os.system('chflags nouchg') command does not work, not even with sudo, and not even in the Terminal doing it manually.
Using os.rename() in Python corrupts the file
Using os.system('mv oldFile.txt newFile.txt') corrupts the file
Manually using mv command in Terminal corrupts the file
Manually changing the filename in the Finder does not (wtf?)
I kept looking for workarounds but nothing did the job.
Round 2:
Progress!
After much tinkering, I discovered a hidden character inside the file. I ran cat /path/longfilename.txt in Terminal, selected and copied the output and pasted into VScode. Here is what I saw:
Somehow a hidden character is getting into the project code number.
Pasting it into a Unicode search engine it came up as a ZERO WIDTH NO-BREAK SPACE also known in Unicode as EF BB BF. However, when pasting this symbol into TextMate it shows up as <U+FEFF> which is?...
The Byte Order Mark!
Opening a normal utf-8 text file in a hex editor also shows the files starting with EFBBBF for the BOM.
Now, the text file being read and parsed at first has no blank lines to start the file, so I added a line break, and also tried adding some spaces. This time when writing the file I could open it, however, after sending it to the trash, the same behavior occurred and the file was broken again. It seems that because other corrupted versions were in the trash, it added the symbol back to the file name for some reason.
So what appears to be happening, for whatever reason, when Python opens the text file I'm parsing that has no line break at the top, it seems to be grabbing the BOM from the file and adding that to the first variable which is grabbing the first line of the text file. Since that text is a number code that starts the file name, the BOM symbol is being added to the file name as well as the code inside the text file.
Just... wow
The Current Solution:
I have to leave a blank line at the start of the text file that I'm opening and parsing and a simple line break won't do it. I have no idea why this is. I added some spaces for good measure because randomly the BOM would be added to the variable and filename again. So far (knock on wood) as long as the first line of that initial file has some spaces and then a line break, and previous corrupted files have been deleted from the trash, a long file name can be used for all the files I'm creating and writing to without any problems.
This corruption even persists if I remove the encoding flag from both of the open functions I'm using (one to read and parse, the other to create and write).
If anyone knows why this is happening, please share. I've never seen it mentioned before. I'm not sure if it's a python 3.8 bug, a mac OS bug, the way TextMate wrote the original file, or a combination of these.
Correct Solution:
Thanks to #tripleee for the proper way to handle this, as I don't remember seeing this before, though I haven't been using python for very long.
In order to ignore the BOM, reading in the text file to be parsed with an encoding='utf-8-sig' does the job. Seems to be why it exists. :)
Problem solved.

python pandas.DataFrame.to_csv newline issue

Greeting Dear community.
I need to write a python pandas.DataFrame into csv
I try to use something like this:
dfPRR.to_csv(prrDumpName,index=False,quotechar="'",quoting=csv.QUOTE_ALL)
It works fine in some sample but for other sample with long string. I encounter the issue that one record breaks into 2 or 3 different line.
what I want my output file:
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
Instead what I get...the file breaking into two seperate line::
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
Is there an option to tell to_csv use a specific record delimitor ?
I do not see that option in documentation of to_csv
What my goal is to create a csv and then a loader program will load the csv
As for now, the loader program cannot load the file when this happens. As it is not able to tell the record is finish or not..?
I see other sample file that the record does not break into 2 or 3 lines when the string is not as long. This is the desired behavior.
How I can enforce this ??

Python not inserting an EOF character after file close?

I'm having a strange problem where some Python code that prints to a file is not inserting an EOF character. Basically, the Python script generates runscripts to later be submitted as jobs on a cluster. I essentially wrote the entire runscript between """'s, allowing for variables to be plugged in (to vary some parameters in my simulation). I write the runscripts using the
with open(file_name, 'w') as runscrpt:
runscrpt.write("""ENTIRE_FILE_CONTENTS_HERE""")
syntax. I can give the actual code if necessary but it's not much more than above. Despite the script running fine and generating all of my runsripts, whenever I submitted them nothing happened. It took me a long time to figure out why, but it's because they're missing an EOF character. I can fix it by, for example, opening one, adding some trailing whitespace or a blank line somewhere in vim, and resaving the file.
Why isn't Python inserting the EOF character, and is there a better way to fix this than manually making trivial edits to all the files with vim?
Sounds like you mean there is no EOL (not EOF!) at the end, because that's what diff will typically tell you. Just add a newline at the end of the write (make sure there is a newline before the final """ terminator, or write a separate newline explicitly).
with open(file_name, 'w') as runscript:
runscript.write("""ENTIRE_FILE_CONTENTS_HERE\n""")
(As a bonus, I added the missing vowel.)

Cannot read in files

I have a small problem with reading in my file. My code:
import csv as csv
import numpy
with open("train_data.csv","rb") as training:
csv_file_object = csv.reader(training)
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
data = numpy.array(data)
I get the error no such file "train_data.csv", so I know the problem lies with the location. But whenever I specify the pad like this: open("C:\Desktop...etc) it doesn't work either. What am I doing wrong?
If you give the full file path, your script should work. Since it is not, it must be that you have escape characters in your path. To fix this, use a raw-string to specify the file path:
# Put an 'r' at the start of the string to make it a raw-string.
with open(r"C:\path\to\file\train_data.csv","rb") as training:
Raw strings do not process escape characters.
Also, just a technical fact, not giving the full file path causes Python to look for the file in the directory that the script is launched from. If it is not there, an error is thrown.
When you use open() and Windows you need to deal with the backslashes properly.
Option 1.) Use the raw string, this will be the string prefixed with an r.
open(r'C:\Users\Me\Desktop\train_data.csv')
Option 2.) Escape the backslashes
open('C:\\Users\\Me\\Desktop\\train_data.csv')
Option 3.) Use forward slashes
open('C:/Users/Me/Desktop/train_data.csv')
As for finding the file you are using, if you just do open('train_data.csv') it is looking in the directory you are running the python script from. So, if you are running it from C:\Users\Me\Desktop\, your train_data.csv needs to be on the desktop as well.

Parsing large (20GB) text file with python - reading in 2 lines as 1

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.
inputFileHandle = open(inputFileName, 'r')
row = 0
for line in inputFileHandle:
row = row + 1
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)
I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.
Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.
It's a bug in Python.
Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread().
In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF."
Oddly, there is an almost exact copy of this function in Perl source code:
http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668
The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?]
At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.
And the work-around:
But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().
The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).
The code you've posted looks fine by itself, so I would suspect a bug in your Python build.
FWIW, the snippet would be a little cleaner if it used enumerate:
inputFileHandle = open(inputFileName, 'r')
for row, line in enumerate(inputFileHandle):
if line_meets_condition:
outputFileHandle.write(line)
else:
lstIgnoredRows.append(row)

Categories