File reading and regex - Python

File reading and regex - Python - python

I read a file which has a line : Fixes: Saurabh Likes python
I want to remove the Fixes: part of above line. I am employing regex for that
but the snippet below returns output like
Saurabh Likes python\r
I am wondering where \r is coming from. I tried all strip options for removing it like rstrip(), lstrip(), etc. But nothing worked. Could anybody suggest me the way to get rid of \r.
patternFixes ='\s*'+'Fixes'+':'+'\s*'
matchFixes= re.search(patternFixes,line, re.IGNORECASE)
if matchFixes:
patternCompiled = re.compile(patternFixes)
line=patternCompiled.sub("", line)
#line=line.lstrip()
relevantInfo = relevantInfo+line
continue
Thanks in advance!
-Saurabh

Suggestion to get rid of \r:
I suppose you have opened your file using open(filename). Following the manual of open:
If mode is omitted, it defaults to 'r'. ... In addition to the
standard fopen() values mode may be 'U' or 'rU'. Python is usually
built with universal newlines support; supplying 'U' opens the file as
a text file, but lines may be terminated by any of the following: the
Unix end-of-line convention '\n', the Macintosh convention '\r', or
the Windows convention '\r\n'. All of these external representations
are seen as '\n' by the Python program.
So, in short, please try to open your file using 'rU' and see if the \r vanishes:
with open(filename, "rU") as f:
# do your stuff here.
...
Does the \r vanish in your output?
Of course your code looks rather clunky, but other have already commented on this part.

You probably opened the file in binary mode (open(filename, "rb") or something like that). Don't do this if you're working with text files.
Use open(filename) instead. Now Python will automatically normalize all newlines to \n, regardless of the current platform.
Also, why not simply patternFixes = r'\s*Fixes:\s*'? Why all the +es?
Then, you're doing a lot of unnecessary stuff like recompiling a regex over and over.
So, my suggestion (which does the same thing as your code (plus the file handling):
r = re.compile(r'\s*Fixes:\s*')
with open(filename) as infile:
relevantInfo = "".join(r.sub("", line) for line in infile if "Fixes:" in line)

>>> import re
>>> re.sub('Fixes:\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r'
>>> re.sub('\s*'+'Fixes'+':'+'\s*', '', 'Fixes: Saurabh Likes python')
'Saurabh Likes python'
No '\r' again
can you provide more details on how to reproduce?
EDIt cannot reproduce with your code neither
>>> line = 'Fixes: Saurabh Likes python'
>>> patternFixes ='\s*'+'Fixes'+':'+'\s*'
>>> matchFixes= re.search(patternFixes,line, re.IGNORECASE)
>>> if matchFixes:
... patternCompiled = re.compile(patternFixes)
... line=patternCompiled.sub("", line)
... print line
... line=line.lstrip()
... print line
...
Saurabh Likes python
Saurabh Likes python
>>>

The '\r' is a carriage return -- http://en.wikipedia.org/wiki/Carriage_return, and it's being picked up from your file.
I will note that if all the lines you need to 'fix' actually DO start with "Fixes: " and that's all you want to change, you could just do something like:
line = line[line.find('Fixes: ')+7:-1]
Saves you all the regex stuff. Not sure on performance, though. And this SHOULD kill your '\r's at the same time.

Related

Pandas Output File not separating into different lines

I have this:
with open(str(ssis_txt_file_names_only[a]) + '.dts', 'w', encoding='utf16') as file:
whatever = whatever.replace("\n","")
print(whatever)
file.write(str(whatever))
When I do a print(whatever) all of the text appears on 1 line instead of broken up. Do anyone know what might be the cause?
Currently, my output looks like this:
>N</IsConnectionProperty> <Flags> 0</Flags> </AdapterProperty> <AdapterProperty>
What I want is this:
>N<I/IsConnectionProperty>
<Flags> 0</Flags>
</AdapterProperty>
<AdapterProperty>
Shouldn't the \n be doing this?

Your line whatever = whatever.replace("\n","") is replacing all linebreaks with nothing, so that's the culprit.
To your issue in the comments, Notepad doesn't recognize \n only as a linebreak; it needs the full Windows-style \r\n. Chances are if you open it in another editor, you'll see the linebreaks if you comment out the .replace line. Alternatively, if you make the line read whatever = whatever.replace("\n","\r\n"), it should display as expected in Notepad.

Why python2 shows \r (Raw escaped) and python3 does not?

I have been having a path error: No file or directory found for hours. After hours of debugging, I realised that python2 added an invisible '\r' at the end of each line.
The input: (trainval.txt)
Images/K0KKI1.jpg Labels/K0KKI1.xml
Images/2KVW51.jpg Labels/2KVW51.xml
Images/MMCPZY.jpg Labels/MMCPZY.xml
Images/LCW6RB.jpg Labels/LCW6RB.xml
The code I used to debug the error
with open('trainval.txt', "r") as lf:
for line in lf.readlines():
print ((line),repr(line))
img_file, anno = line.strip("\n").split(" ")
print(repr(img_file), repr(anno))
Python2 output:
("'Images/K0KKI1.jpg'", "'Labels/K0KKI1.xml\\r'")
('Images/2KVW51.jpg Labels/2KVW51.xml\r\n', "'Images/2KVW51.jpg Labels/2KVW51.xml\\r\\n'")
("'Images/2KVW51.jpg'", "'Labels/2KVW51.xml\\r'")
('Images/MMCPZY.jpg Labels/MMCPZY.xml\r\n', "'Images/MMCPZY.jpg Labels/MMCPZY.xml\\r\\n'")
("'Images/MMCPZY.jpg'", "'Labels/MMCPZY.xml\\r'")
('Images/LCW6RB.jpg Labels/LCW6RB.xml\r\n', "'Images/LCW6RB.jpg Labels/LCW6RB.xml\\r\\n'")
("'Images/LCW6RB.jpg'", "'Labels/LCW6RB.xml\\r'")
Python3 output:
Images/K0KKI1.jpg Labels/K0KKI1.xml
'Images/K0KKI1.jpg Labels/K0KKI1.xml\n'
'Images/K0KKI1.jpg' 'Labels/K0KKI1.xml'
Images/2KVW51.jpg Labels/2KVW51.xml
'Images/2KVW51.jpg Labels/2KVW51.xml\n'
'Images/2KVW51.jpg' 'Labels/2KVW51.xml'
Images/MMCPZY.jpg Labels/MMCPZY.xml
'Images/MMCPZY.jpg Labels/MMCPZY.xml\n'
'Images/MMCPZY.jpg' 'Labels/MMCPZY.xml'
Images/LCW6RB.jpg Labels/LCW6RB.xml
'Images/LCW6RB.jpg Labels/LCW6RB.xml\n'
'Images/LCW6RB.jpg' 'Labels/LCW6RB.xml'
As annoying as it was, it was that small '\r' who caused the path error. I could not see it in my console until I write the script above. My question is: Why is this '\r' even there? I did not create it. Something somewhere added it there. It would be helpful if someone could tell me what is the use of this small 'r' , why did it appear in python2 and not in python3 and how to avoid getting bugs due to it.

there's probably a subtle difference of processing between Windows text file in python 2 & 3 versions.
The issue here is that your file has a Windows text format, and contains one or several carriage return chars before the linefeed. A quick & generic fix would be to change:
img_file, anno = line.strip("\n").split(" ")
by just:
img_file, anno = line.split()
Without arguments str.split is very smart:
it splits according to any kind of whitespace (linefeed, space, carriage return, tab)
it removes empty fields (no need for strip after all)
So use that cross-platform/python version agnostic form unless you need really specific split operation, and your problems will be history.
As an aside, don't do for line in lf.readlines(): but just for line in lf:, it will read & yield the lines one by one, handy when the file is big so you don't consume too much memory.

Find TM superscript in python 2 using regex

My text file includes "SSS™" as one of its words and I am trying to find it using regular expression. My problem is with finding ™ superscript. My code is:
import re
path='G:\python_code\A.txt'
f_general=open(path, 'r')
special=re.findall(r'\U2122',f_general.read())
print(special)
but it doesn't print anything. How can I fix it?

It may have to do with the encoding of your file. Try this:
import re
path = "g:\python_code\A.txt"
f_general=open(path, "r", encoding="UTF-16")
data = f_general.read()
special=re.findall(chr(8482), data)
print(special)
print(chr(8482))
Note I'm using the decimal value for Trade mark. This is the site I use:
https://www.ascii.cl/htmlcodes.htm
So, open the file you have in notepad. Do a save as and choose encoding unicode and this should all work. Working with extended ascii can be a hassle. I am using Python 3.6 but I think this should still work in 2.x
Note when it prints out the chr(8482) in your command line it will probably just be a T, at least that is what I get in windows.
update
Try this for Python 2 and this should capture the word before trademark:
import re
with open("g:\python_code\A.txt", "rb") as f:
data = f.read().decode("UTF-16")
regex = re.compile("\S+" + chr(8482))
match = re.search(regex, data)
if match:
print (match.group(0))

Is it possible to export a list to .txt such that line breakes can be read by notepad?

For accessability reasons I wonder it is possible, in python, to export a list to .txt such that line breaks can be read by notepad? Below is an example code that is read correctly in notepad++ but not in notepad. In notepad++ each entry of the list is on a separate line, in notepad all entries are on the same line.
string =['str1 123','str2 234','str3 345']
outF = open("outp.txt", "w")
for item in string:
outF.write("%s\n" % item)
outF.close()

Windows uses Carriage Return, Line Feed: \r\n to indicate line breaks, which is the only line-ending recognized by Windows notepad:
In [7]: s = ['hello', 'world']
In [8]: with open('test.txt', 'w') as f:
...: for item in s:
...: f.write('%s\r\n' % item)
Example:
Linux based systems use Line Feed to indicate line breaks, and old Mac OS's used to use just Carriage Return, and an editor like Notepad++ can be configured to recognize all of these, while notepad cannot.

I'll flesh out the comment's answer a little bit. Windows only recognizes carriage returns as valid points to make a new line. Therefore, it is best practice to use both carriage return and newline when making a line break in text.
So do:
outF.write("%s\r\n" % item)

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?

Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program

You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this

(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)

As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).

dos linebreaks are \r\n, unix only \n. So just search for \r\n.

Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'

You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

File reading and regex - Python - python

Related

Pandas Output File not separating into different lines

Why python2 shows \r (Raw escaped) and python3 does not?

Find TM superscript in python 2 using regex

Is it possible to export a list to .txt such that line breakes can be read by notepad?

How can I detect DOS line breaks in a file?

Categories

Resources