Replacing \n while keeping \r\n intact - python

I have a huge CSV file (196244 line) where it has \n in place other than new lines, I want to remove those \n but keep \r\n intact.
I've tried line.replace but seems like it is not recognizing \r\n so next I tried regex
with open(filetoread, "r") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
line = re.sub("(?<!\r)\n", " ", line)
fixed.write(line)
but it is not keeping \r\n it is removing everything. I can't do it in Notepad++ it is crashing on this file.

You are not exposing the line breaks to the regex engine. Also, the line breaks are "normalized" to LF when using open with r mode, and to keep them all in the input, you can read the file in in the binary mode using b. Then, you need to remember to also use the b prefix with the regex pattern and replacement.
You can use
with open(filetoread, "rb") as inf:
with open(filetowrite, "wb") as fixed:
fixed.write(re.sub(b"(?<!\r)\n", b" ", inf.read()))
Now, the whole file will be read into a single string (with inf.read()) and the line breaks will be matched, and eventually replaced.
Pay attention to
"rb" when reading file in
"wb" to write file out
re.sub(b"(?<!\r)\n", b" ", inf.read()) contains b prefixes with string literals, and inf.read() reads in the file contents into single variable.

When you open a file with a naive open() call, it will load a view of the file with a variety of newlines to be simply \n via TextIOWrapper
Explicitly setting newline="\r\n" should allow you to read and write the newlines the way you expect
with open(path_src, newline="\r\n") as fh_src:
with open(path_dest, "w", newline="\r\n") as fh_dest:
for line in fh_src: # file-likes are iterable by-lines
fh_dest.write(line[:-2].replace("\n", " "))
fh_dest.write("\r\n")
content example
>>> with open("test.data", "wb") as fh:
... fh.write(b"""foo\nbar\r\nbaz\r\n""")
...
14
>>> with open("test.data", newline="\r\n") as fh:
... for line in fh:
... print(repr(line))
...
'foo\nbar\r\n'
'baz\r\n'

Related

replace ^M(control M character) in a text file in python

The file is like this:
This line has control character ^M this is bad
I will try it
I want to remove control M characters in the file, and create a new file like this using Python
This line has control character this is bad
I will try it
I tried the methods I found in stack overflow and use regular expression like this:
line.replace("\r", "r")
and
line.replace("\r\n", "r")
Here is part of the code snippet:
with open(file_path, "r") as input_file:
lines = input_file.readlines()
new_lines = []
for line in lines:
new_line = line.replace("\r", "")
new_lines.append(new_line)
new_file_name = "replace_control_char.dat"
new_file_path = os.path.join(here, data_dir, new_file_name)
with open(new_file_path, "w") as output_file:
for line in new_lines:
output_file.write(line)
However, the new file I got is:
This line has control character
this is bad
I will try it
"This line has control character" and " this is bad" are not on the same line. I expect remove control M character will make these two phrases on the same line.
Can someone help me solve this issue?
Thanks,
Arthur
You cannot rely on text mode in that case.
On Windows understands sole \r as linefeeds (even if the "official" line terminator is \r\n) and on Macintosh, the line terminator can be only \r. Text mode converts linefeeds as \n or remove them if followed by \n, so it destroys the information you need.
Universal newlines by default makes this code also fail on Unix/Linux. Python behaves the same on all platforms
Python doesn’t depend on the underlying operating system’s notion of text files; all the the processing is done by Python itself, and is therefore platform-independent.
If you want to remove those, you have to use binary mode.
with open(file_path, "rb") as input_file:
contents = input_file.read().replace(b"\r",b"")
with open(file_path, "wb") as output_file:
output_file.write(contents)
That code will remove all \r characters (including line terminators). That works but if your aim is just to remove stray \r and preserve endlines, another method is required.
One way to do it is to use a regular expression, which can accept binary (bytes) as well:
re.sub(rb"\r([^\n])",rb"\1",contents)
That regular expression removes \r chars only if not followed by \n chars, efficiently preserving CR+LF windows end-of-line sequences.

to change a text file containing multiline strings

I have a text file consisting of multiline (hundreds of lines actually) strings. Each of the strings starts with '&' sign. I want to change my text file in a way that only the first 300 characters of each string remain in the new file. How I can do this by using python?
You can read a file and loop over the lines to do what you want. Strings are easily slicable in python to get the first 300 to write to another file.
file = open(path,"r")
lines = file.readlines()
newFile = open(newPath,"w")
for index, line in enumerate(lines):
newLine = line[0:301]
newFile.writelines([newLine])
Hope this is what you meant
You could do something like this:
# Open output file in append mode
with open('output.txt', 'a') as out_file:
# Open input file in read mode
with open("input.txt", "r") as in_file:
for line in in_file:
# Take first 300 characters from line
# I believe this works even when line is < 300 characters
new_line = line[0:300]
# Write new line to output
# (You might need to add '\n' for new lines)
out_file.write(new_line)
print(new_line)
You can use the string method split to split your lines, then you can use slices to keep only the 300 first characters of each split.
with open("oldFile.txt", "rt") as old_file, open("newFile.txt", "wt") as new_file:
for line in old_file.read().split("&"):
new_file.write("&{}\n".format(line[:300]))
This version preserves ends of line \n within your strings.
If you want to remove ends of line in each individual string, you can use replace:
with open("oldFile.txt", "rt") as old_file, open("newFile.txt", "wt") as new_file:
for line in old_file.read().split("&"):
new_file.write("&{}\n".format(line.replace("\n", "")[:300]))
Note that your new file will end with an empty line.
Another note is, depending on the size of your file, you may rather use a generator function version, instead of split which results in the whole file content being loaded in memory as a list of strings.

Replace an arrow character, repeating headers and blank lines in text file and paste the data cleanly in Excel sheet

My attempt to remove arrow character, blank lines and headers from this text file is as below -
I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file.
Please help, Thanks so much
I have attached sample file as well.
import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
for line in oldfile:
newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))
with open('MICnew.txt','r+') as file:
for line in file:
if not line.isspace():
file.write(line)
You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.
To avoid this, read in all the contents of the file first, then loop over them and write out what you want:
file_data = Path('MICnew.txt').read_text()
with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
for line in file_data.splitlines():
if not line.isspace():
file.write(line)
But that double loop is a bit clumsy and you can instead combine the two steps into one:
with open('MIC.txt', errors='ignore') as oldfile,
open('MICnew.txt', 'w') as newfile:
for line in oldfile:
clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
if not clean_line.isspace():
newfile.write(clean_line)
In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

How to write to a file with newline characters and avoid empty lines

I'm trying to write encoded data to a file and separate each run with a newline character. However, when doing this there is an empty line between each run -- as shown below.
Using .rstrip()/.strip() only really works when reading the file -- and obviously this cannot be used directly when writing to the file as it would write all the data to a single line.
cFile = open('compFile', 'w')
for i in range(num_lines):
line = validLine()
compressedFile.write(line + "\n")
cFile.close()
cFile = open('compFile', 'r')
for line in cFile:
print(line)
# Empty space output:
023
034
045
# Desired output:
023
034
045
I think you already did what you want if you have a look at your text file.
Be aware, that python reads the \n at the end of your file too and that print() makes a newline at the end of the printed line.
In your case that means your file should look like
023\n
034\n
045\n
When printing, you at first read 023\n and then as python does with the print() function you append a \n to your line.
Then you have the 023\n\n you get in your console. But in the file you have what you want.
If you just want to print without linebreak, you can use
import sys
sys.stdout.write('.')
You could use
for i in range(num_lines):
line = validLine()
compressedFile.write(line.strip() + "\n")
# ^^^
cFile.close()
Off-topic but consider using with () additionally.
Using .rstrip()/.strip() only really works when reading the file -- and obviously this cannot be used directly when writing to the file as it would write all the data to a single line.
This is a misconception. Using .rstrip() is exactly the correct tool if you need to write a series of strings, some of which may have a newline character attached:
with open('compFile', 'w') as cFile:
for i in range(num_lines):
line = validLine().rstrip("\n") # remove possible newline
compressedFile.write(line + "\n")
Note that if all your lines already have a newline attached, you don't have to add more newlines. Just write the string directly to the file, no stripping needed:
with open('compFile', 'w') as cFile:
for i in range(num_lines):
line = validLine() # line with "\n" newline already present
compressedFile.write(line) # no need to add a newline anymore
Next, you are reading lines with newlines from your file and then printing them with print(). By default, print() adds another newline, so you end up with double-spaced lines; your input file contains 023\n034\n045\n, but printing each line ('023\n', then '034\n', then '045\n') adds a newline afterwards and you write out 023\n\n034\n\n045\n\n out to stdout.
Either strip that newline when printing, or tell print() to not add a newline of its own by giving it an empty end parameter:
with open('compFile', 'r') as cFile:
for line in cFile:
print(line, end='')

Convert \r text to \n so readlines() works as intended

In Python, you can read a file and load its lines into a list by using
f = open('file.txt','r')
lines = f.readlines()
Each individual line is delimited by \n but if the contents of a line have \r then it is not treated as a new line. I need to convert all \r to \n and get the correct list lines.
If I do .split('\r') inside the lines I'll get lists inside the list.
I thought about opening a file, replace all \r to \n, closing the file and reading it in again and then use the readlines() but this seems wasteful.
How should I implement this?
f = open('file.txt','rU')
This opens the file with Python's universal newline support and \r is treated as an end-of-line.
If it's a concern, open in binary format and convert with this code:
from __future__ import with_statement
with open(filename, "rb") as f:
s = f.read().replace('\r\n', '\n').replace('\r', '\n')
lines = s.split('\n')

Categories