why different download way result in different display? - python

When i down the file on the web with my firefox,
http://quotes.money.163.com/service/lrb_000559.html
it looks fine in my EXCEL.
When i down the file with my python code,
from urllib.request import urlopen
url="http://quotes.money.163.com/service/lrb_000559.html"
html=urlopen(url)
outfile=open("g:\\000559.csv","w")
outfile.write(html.read().decode("gbk"))
outfile.close()
it looks stange, when open it with my EXCEL,there is one line filled with proper content ,and one line filled with blank ,you can try it in your pc.
Why will different download way result in different display ?

My guess is that line endings are changed when decoding and writing the result in python. Try using a binary file instead. Off the top of my head, I think it would go something like this:
outfile=open("g:\\000559.csv","wb")
outfile.write(html.read())

Add a 'b' flag to the file open, i.e. change this:
outfile=open("g:\\000559.csv","w")
To this:
outfile=open("g:\\000559.csv","wb")
Explanation here. The original file had a \r\n, and Python is converting the \n to \r\n, meaning you have an extra carriage return at the end of every line (\r\r\n).

Related

Strange behavior when trying to create and write to a text file on macOS [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed last year.
I'm opening a plain text file, parsing it, and adding different lines to existing, empty string variables. I add these variables into a new variable that is a multi-line fstring. Trying to write the data to a new text file is not behaving as expected.
Reading the original file works fine. Text is properly parsed, variables populated.
The multi-line fstring variable seems fine. Prints normally. Even tried formatting it different ways which I show below.
When writing to a new file, that's where the strangeness starts. I've tried 2 ways:
Straight coding the open function with w or w+
Adding the above to a function and using that inside main()
The file is saved to disk with the correct name. Trying to double-click open in Finder produces nothing. Right-click to open produces nothing. Trying to move to trash with command+delete gives an error:
It sounds like the file goes to trash, but as the file disappears from the folder a new one is created with the same name in its place.
If I try to open in TextMate via File > Open, it opens as a blank file with no errors.
Since I can't get rid of the file, I have to delete the directory and create the directory again with the same name, or force delete in Terminal using rm. Restarting the system does not help. Relaunching Finder does nothing. Saving text files from other apps works fine. Directory is chmod 755.
If I copy an existing text file into the output directory, rename it to what the file is expected to be named, and let python overwrite the contents, it doesn't work either. The file modification date changes (and I see the file "blink" in Finder) but the contents remain the same. However, the file is not corrupted and opens normally.
If I do the same but delete the text inside of the copied file first, then run the script, python writes no data to the file, I can't open it by double-clicking on it, and I get error -43 again with the odd non-trashing behavior.
The strangest thing is this: if I add another with open() at the end of the script, and open the file that was just created and supposedly written to, and print its contents, the contents print. It's like when the script ends the file contents are being removed or its being corrupted somehow. Tried to close the file inside the script even though it's not needed, but same behavior persists.
Code:
Here's the code for writing:
FORMAT='utf-8'
OUTPUT_DIR = '/Path/To/SaveFolder'
# as a function
def write_to_file(content, fpath, name):
the_file = os.path.join(fpath, name)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(content)
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
write_to_file(multiline_var, OUTPUT_DIR, filename)
# or hard coded in main()
def main():
print(f" Writing File...\n")
filename = f"{pcode}_{author}_{title}_text.txt"
the_file = os.path.join(OUTPUT_DIR, filename)
with open(the_file, 'w+', encoding=FORMAT) as t:
t.write(multiline_var)
I have tried using w w+ wt and wt+ and with and without encoding='utf-8'
Here is an example of multi-line fstring variable:
# using triple quotes
multiline_var = f"""
[PROJ-{pcode}] {full_title} by {author}
{description}
{URL}
{DIVIDER_1}
{TEXT_BLURB}
Some text here and then {SOME_MORE_TEXT}"
{DIVIDER_1}
{SOME_LINK}
"""
# or inside parens
multiline_var = (
f"[PROJ-{pcode}] {full_title} by {author}\n"
f"{description}\n\n"
f"{URL}\n"
f"{DIVIDER_1}\n"
f"{TEXT_BLURB}\n\n"
f"Some text here and then {SOME_MORE_TEXT}\n"
f"{DIVIDER_1}\n\n"
f"{SOME_LINK}"
)
Using exiftool on the text file shows the following, so it looks the data is there but must be corrupted:
File Size : 1797 bytes
File Modification Date/Time : 2021:12:31 15:55:39-05:00
File Access Date/Time : 2021:12:31 15:58:13-05:00
File Inode Change Date/Time : 2021:12:31 15:55:39-05:00
File Permissions : -rw-r--r--
File Type : TXT
File Type Extension : txt
MIME Type : text/plain
MIME Encoding : utf-8
Byte Order Mark : No
Newlines : Unix LF
Line Count : 55
Word Count : 181
Not sure what I'm doing wrong. VScode shows no syntax errors in the script. There are no errors in Terminal when running the script. Have I made some simple mistake in the above code? Maybe the fstring variable is causing a problem?
Thanks to #bnaecker for leading me to the solution to this problem.
It appeared that when creating/writing to a text file with a long name, Python can corrupt it. Not sure why, as I save long names for images with Python image libraries all the time. Using a short name like "MyFile.txt" it worked just fine, but that was a red herring.
I have updated this post with my journey to the final solution for using the long names that are needed for my project, though I'm not sure why the problem exists.
First Attempts:
So far creating using a short name and then renaming to a long one.... attempts have failed. I did notice that python is locking the file it creates and never unlocks it. Not sure if this is the problem. Setting chflags with os.system('chflags nouchg') command does not work, not even with sudo, and not even in the Terminal doing it manually.
Using os.rename() in Python corrupts the file
Using os.system('mv oldFile.txt newFile.txt') corrupts the file
Manually using mv command in Terminal corrupts the file
Manually changing the filename in the Finder does not (wtf?)
I kept looking for workarounds but nothing did the job.
Round 2:
Progress!
After much tinkering, I discovered a hidden character inside the file. I ran cat /path/longfilename.txt in Terminal, selected and copied the output and pasted into VScode. Here is what I saw:
Somehow a hidden character is getting into the project code number.
Pasting it into a Unicode search engine it came up as a ZERO WIDTH NO-BREAK SPACE also known in Unicode as EF BB BF. However, when pasting this symbol into TextMate it shows up as <U+FEFF> which is?...
The Byte Order Mark!
Opening a normal utf-8 text file in a hex editor also shows the files starting with EFBBBF for the BOM.
Now, the text file being read and parsed at first has no blank lines to start the file, so I added a line break, and also tried adding some spaces. This time when writing the file I could open it, however, after sending it to the trash, the same behavior occurred and the file was broken again. It seems that because other corrupted versions were in the trash, it added the symbol back to the file name for some reason.
So what appears to be happening, for whatever reason, when Python opens the text file I'm parsing that has no line break at the top, it seems to be grabbing the BOM from the file and adding that to the first variable which is grabbing the first line of the text file. Since that text is a number code that starts the file name, the BOM symbol is being added to the file name as well as the code inside the text file.
Just... wow
The Current Solution:
I have to leave a blank line at the start of the text file that I'm opening and parsing and a simple line break won't do it. I have no idea why this is. I added some spaces for good measure because randomly the BOM would be added to the variable and filename again. So far (knock on wood) as long as the first line of that initial file has some spaces and then a line break, and previous corrupted files have been deleted from the trash, a long file name can be used for all the files I'm creating and writing to without any problems.
This corruption even persists if I remove the encoding flag from both of the open functions I'm using (one to read and parse, the other to create and write).
If anyone knows why this is happening, please share. I've never seen it mentioned before. I'm not sure if it's a python 3.8 bug, a mac OS bug, the way TextMate wrote the original file, or a combination of these.
Correct Solution:
Thanks to #tripleee for the proper way to handle this, as I don't remember seeing this before, though I haven't been using python for very long.
In order to ignore the BOM, reading in the text file to be parsed with an encoding='utf-8-sig' does the job. Seems to be why it exists. :)
Problem solved.

Write to an HTML file with Python

I have a couple of graphs I need to display in my browser offline, MPLD3 outputs the html as a string and I need to be able to make an html file containing that string. What I'm doing right now is:
tohtml = mpld3.fig_to_html(fig, mpld3_url='/home/pi/webpage/mpld3.js',
d3_url='/home/pi/webpage/d3.js')
print(tohtml)
Html_file = open("graph.html","w")
Html_file.write(tohtml)
Html_file.close();
tohtml is the variable where the HTML string is stored. I've printed this string to the terminal and then pasted it into an empty HTML file and I get my desired result. However, when I run my code, I get an empty file named graph.html
It seems like you may be reinventing the wheel here. Have you tried something like,
mpld3_url='/home/pi/webpage/mpld3.js'
d3_url='/home/pi/webpage/d3.js'
with open('graph.html', 'w') as fileobj:
mpld3.save_html(fig, fileobj, d3_url=d3_url, mpld3_url=mpld3_url)
Note, this is untested just going off of mpld3.save_html documentation and using prior knowledge about Python IO Streams

Pandas Output File not separating into different lines

I have this:
with open(str(ssis_txt_file_names_only[a]) + '.dts', 'w', encoding='utf16') as file:
whatever = whatever.replace("\n","")
print(whatever)
file.write(str(whatever))
When I do a print(whatever) all of the text appears on 1 line instead of broken up. Do anyone know what might be the cause?
Currently, my output looks like this:
>N</IsConnectionProperty> <Flags> 0</Flags> </AdapterProperty> <AdapterProperty>
What I want is this:
>N<I/IsConnectionProperty>
<Flags> 0</Flags>
</AdapterProperty>
<AdapterProperty>
Shouldn't the \n be doing this?
Your line whatever = whatever.replace("\n","") is replacing all linebreaks with nothing, so that's the culprit.
To your issue in the comments, Notepad doesn't recognize \n only as a linebreak; it needs the full Windows-style \r\n. Chances are if you open it in another editor, you'll see the linebreaks if you comment out the .replace line. Alternatively, if you make the line read whatever = whatever.replace("\n","\r\n"), it should display as expected in Notepad.

Python getting unrecognizable characters after reading data from file

I'm using Python to recreate a program that have been written in Fortran 95, the program opens a binary file, containing only float numbers, and read a specific value, it works just fine in Fortran, when I execute the code, I get 284.69 for example.
Although, when I try to do the same in Python, reading the entire first line of the file, I get characters like these:
Y{�C�x�Cz~�C�x�C�j�C�r�C�v�Ch�Ck�CVx�C
Here is how I open the file and read the values:
f = open(args.model_files[0], "r").readlines()
print str(f[0])
I can't provide a file as example, because it is too big, but I affirm that there is only float numbers.
I would like to at least understand what type of characters I'm getting, or what I'm doing wrong when opening the file, any suggestion is welcome.

Parse log file in python

I have a log file that has lines that look like this:
"1","2546857-23541","f_last","user","4:19 P.M.","11/02/2009","START","27","27","3","c2546857-23541",""
Each line in the log as 12 double quote sections and the 7th double quote section in the string comes from where the user typed something into the chat window:
"22","2546857-23541","f_last","john","4:38 P.M.","11/02/2009","
What's up","245","47","1","c2546857-23541",""
This string also shows the issue I'm having; There are areas in the chat log where the text the user typed is on a new line in the log file instead of the same line like the first example.
So basically I want the lines in the second example to look like the first example.
I've tried using Find/Replace in N++ and I am able to find each "orphaned" line but I was unable to make it join the line above it.
Then I thought of making a python file to automate it for me, but I'm kind of stuck about how to actually code it.
Python errors out at this line running unutbu's code
"1760","4746880-00129","bwhiteside","tom","11:47 A.M.","12/10/2009","I do not see ^"refresh your knowledge
^" on the screen","422","0","0","c4746871-00128",""
The csv module is smart enough to recognize when a quoted item is not finished (and thus must contain a newline character).
import csv
with open('data.log',"r") as fin:
with open('data2.log','w') as fout:
reader=csv.reader(fin,delimiter=',', quotechar='"', escapechar='^')
writer=csv.writer(fout, delimiter=',',
doublequote=False, quoting=csv.QUOTE_ALL)
for row in reader:
row[6]=row[6].replace('\n',' ')
writer.writerow(row)
If you data is valid CSV you can use Python's csv.reader class. It should work just fine with your sample data. It may not work correctly depending an what an embeded double-quote looks like from the source system. See: http://docs.python.org/library/csv.html#module-contents.
Unless I'm misunderstanding the problem. You simply need to read in the file and remove any newline characters that occur between double quote characters.

Categories