Pandas Output File not separating into different lines - python

I have this:
with open(str(ssis_txt_file_names_only[a]) + '.dts', 'w', encoding='utf16') as file:
whatever = whatever.replace("\n","")
print(whatever)
file.write(str(whatever))
When I do a print(whatever) all of the text appears on 1 line instead of broken up. Do anyone know what might be the cause?
Currently, my output looks like this:
>N</IsConnectionProperty> <Flags> 0</Flags> </AdapterProperty> <AdapterProperty>
What I want is this:
>N<I/IsConnectionProperty>
<Flags> 0</Flags>
</AdapterProperty>
<AdapterProperty>
Shouldn't the \n be doing this?

Your line whatever = whatever.replace("\n","") is replacing all linebreaks with nothing, so that's the culprit.
To your issue in the comments, Notepad doesn't recognize \n only as a linebreak; it needs the full Windows-style \r\n. Chances are if you open it in another editor, you'll see the linebreaks if you comment out the .replace line. Alternatively, if you make the line read whatever = whatever.replace("\n","\r\n"), it should display as expected in Notepad.

Related

Spaces are different in the same sentence/string

I have some stuff in an Excel spreadsheet, which is loaded into a webpage, where the content is displayed. However, what I have noticed is, that some of the content has weird formatting, i.e. a sudden line shift or something.
Then I just tried to copy the text from the spreadsheet, and pasting it into Notepad++, and enabled "Show White Space and Tab", and then the output was this:
The second line is the one directly copied from the spreadsheet, where the first one is just where I copied the string into a variable in Python, printed it, and then copied the output from the output console.
And as you can see the first line has all dots for space, where the other misses some dots. And I have an idea that that is what is doing this trickery, especially because it's at those place the line shift happens.
I have tried to just do something like:
import pandas as pd
data = pd.read_excel("my_spreadsheet.xlsx")
data["Strings"] = [str(x).replace(" ", " ") for x in data["Strings"]]
data.to_excel("my_spreadsheet.xlsx", index=False)
But that didn't change anything, as if I copied it straight from the output console.
So yeah, is there any easy way to make spaces the same type of spaces, or do I have to do something else ?
I think you would need to figure out which exact character is being used there.
You can load the file and print out the characters one by one together with the character code to figure out what's what.
See the code example below. I added some code to skip alphanumeric characters to reduce the actual output somewhat...
with open("filename.txt") as infile:
text = infile.readlines()
def print_ordinal(text: str, skip_alphanum: bool=True):
for line in text:
for character in line:
if not(skip_alphanum and character.isalnum()):
print(f"{character} - {ord(character)}")
print_ordinal(text)

Write new line at the end of a file

I am working with a numpy array in python. I want to print the array and its properties to a txt output. I want the text output to end with a blank line. How can I do this?
I have tried:
# Create a text document of the output
with open("demo_numpy.txt","w") as text:
text.write('\n'.join(map(str, [a,shape,size,itemsize,ndim,dtype])) + '\n')
And also:
# Create a text document of the output
with open("demo_numpy.txt","w") as text:
text.write('\n'.join(map(str, [a,shape,size,itemsize,ndim,dtype])))
text.write('\n')
However, when I open the file in GitHub desktop, I still get the indication that the last line of the file is "dtype"
When you do "\n".join( ... ) you will get a string of the following form:
abc\ndef\nghi\nhjk
-- in other words, it won't end with \n.
If your code writes another \n then your string will be of the form
abc\ndef\nghi\nhjk\n
But that does not put a blank line at the end of your file because textfiles are supposed to have lines that end in \n. That is what the Posix standard says.
So you need another \n so that the last two lines of your file are
hjk\n
\n
Python will not choke if you ask it to read a textfile where the final trailing \n is missing. But it also won't treat a single trailing \n in a textfile as a blank line. It would not surprise me to learn that GitHub does likewise.
This was solved using the Python 3.x print function, which automatically inserts a new line at the end of each print statement.
Here is the code:
with open("demo_numpy.txt","w") as text:
print(a, file = text)
text.close()
Note- apparently it is more appropriate to use the print function rather than .write when dealing with string values as opposed to binary files.

why different download way result in different display?

When i down the file on the web with my firefox,
http://quotes.money.163.com/service/lrb_000559.html
it looks fine in my EXCEL.
When i down the file with my python code,
from urllib.request import urlopen
url="http://quotes.money.163.com/service/lrb_000559.html"
html=urlopen(url)
outfile=open("g:\\000559.csv","w")
outfile.write(html.read().decode("gbk"))
outfile.close()
it looks stange, when open it with my EXCEL,there is one line filled with proper content ,and one line filled with blank ,you can try it in your pc.
Why will different download way result in different display ?
My guess is that line endings are changed when decoding and writing the result in python. Try using a binary file instead. Off the top of my head, I think it would go something like this:
outfile=open("g:\\000559.csv","wb")
outfile.write(html.read())
Add a 'b' flag to the file open, i.e. change this:
outfile=open("g:\\000559.csv","w")
To this:
outfile=open("g:\\000559.csv","wb")
Explanation here. The original file had a \r\n, and Python is converting the \n to \r\n, meaning you have an extra carriage return at the end of every line (\r\r\n).

File reading and regex - Python

I read a file which has a line : Fixes: Saurabh Likes python
I want to remove the Fixes: part of above line. I am employing regex for that
but the snippet below returns output like
Saurabh Likes python\r
I am wondering where \r is coming from. I tried all strip options for removing it like rstrip(), lstrip(), etc. But nothing worked. Could anybody suggest me the way to get rid of \r.
patternFixes ='\s*'+'Fixes'+':'+'\s*'
matchFixes= re.search(patternFixes,line, re.IGNORECASE)
if matchFixes:
patternCompiled = re.compile(patternFixes)
line=patternCompiled.sub("", line)
#line=line.lstrip()
relevantInfo = relevantInfo+line
continue
Thanks in advance!
-Saurabh
Suggestion to get rid of \r:
I suppose you have opened your file using open(filename). Following the manual of open:
If mode is omitted, it defaults to 'r'. ... In addition to the
standard fopen() values mode may be 'U' or 'rU'. Python is usually
built with universal newlines support; supplying 'U' opens the file as
a text file, but lines may be terminated by any of the following: the
Unix end-of-line convention '\n', the Macintosh convention '\r', or
the Windows convention '\r\n'. All of these external representations
are seen as '\n' by the Python program.
So, in short, please try to open your file using 'rU' and see if the \r vanishes:
with open(filename, "rU") as f:
# do your stuff here.
...
Does the \r vanish in your output?
Of course your code looks rather clunky, but other have already commented on this part.
You probably opened the file in binary mode (open(filename, "rb") or something like that). Don't do this if you're working with text files.
Use open(filename) instead. Now Python will automatically normalize all newlines to \n, regardless of the current platform.
Also, why not simply patternFixes = r'\s*Fixes:\s*'? Why all the +es?
Then, you're doing a lot of unnecessary stuff like recompiling a regex over and over.
So, my suggestion (which does the same thing as your code (plus the file handling):
r = re.compile(r'\s*Fixes:\s*')
with open(filename) as infile:
relevantInfo = "".join(r.sub("", line) for line in infile if "Fixes:" in line)
>>> import re
>>> re.sub('Fixes:\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r'
>>> re.sub('\s*'+'Fixes'+':'+'\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r' again
can you provide more details on how to reproduce?
EDIt cannot reproduce with your code neither
>>> line = 'Fixes: Saurabh Likes python'
>>> patternFixes ='\s*'+'Fixes'+':'+'\s*'
>>> matchFixes= re.search(patternFixes,line, re.IGNORECASE)
>>> if matchFixes:
... patternCompiled = re.compile(patternFixes)
... line=patternCompiled.sub("", line)
... print line
... line=line.lstrip()
... print line
...
Saurabh Likes python
Saurabh Likes python
>>>
The '\r' is a carriage return -- http://en.wikipedia.org/wiki/Carriage_return, and it's being picked up from your file.
I will note that if all the lines you need to 'fix' actually DO start with "Fixes: " and that's all you want to change, you could just do something like:
line = line[line.find('Fixes: ')+7:-1]
Saves you all the regex stuff. Not sure on performance, though. And this SHOULD kill your '\r's at the same time.

Parse log file in python

I have a log file that has lines that look like this:
"1","2546857-23541","f_last","user","4:19 P.M.","11/02/2009","START","27","27","3","c2546857-23541",""
Each line in the log as 12 double quote sections and the 7th double quote section in the string comes from where the user typed something into the chat window:
"22","2546857-23541","f_last","john","4:38 P.M.","11/02/2009","
What's up","245","47","1","c2546857-23541",""
This string also shows the issue I'm having; There are areas in the chat log where the text the user typed is on a new line in the log file instead of the same line like the first example.
So basically I want the lines in the second example to look like the first example.
I've tried using Find/Replace in N++ and I am able to find each "orphaned" line but I was unable to make it join the line above it.
Then I thought of making a python file to automate it for me, but I'm kind of stuck about how to actually code it.
Python errors out at this line running unutbu's code
"1760","4746880-00129","bwhiteside","tom","11:47 A.M.","12/10/2009","I do not see ^"refresh your knowledge
^" on the screen","422","0","0","c4746871-00128",""
The csv module is smart enough to recognize when a quoted item is not finished (and thus must contain a newline character).
import csv
with open('data.log',"r") as fin:
with open('data2.log','w') as fout:
reader=csv.reader(fin,delimiter=',', quotechar='"', escapechar='^')
writer=csv.writer(fout, delimiter=',',
doublequote=False, quoting=csv.QUOTE_ALL)
for row in reader:
row[6]=row[6].replace('\n',' ')
writer.writerow(row)
If you data is valid CSV you can use Python's csv.reader class. It should work just fine with your sample data. It may not work correctly depending an what an embeded double-quote looks like from the source system. See: http://docs.python.org/library/csv.html#module-contents.
Unless I'm misunderstanding the problem. You simply need to read in the file and remove any newline characters that occur between double quote characters.

Categories