Python - Substituting regex matches in byte file - python

Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of \t" text (remove the " but keep the tab).
I then want to remove instances of \n where the character before is nor \r. i.e. \r\n is OK x\n is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.
import re
import sys
import time
originalFile = '14-09 - Copy.txt'
amendedFile = '14-09 - amended.txt'
with open(originalFile, 'r') as content_file:
content = content_file.read()
content = content.replace('\t\"','\t')
with open(amendedFile,'w') as f:
f.write(content)
with open(amendedFile, 'rb') as content_file:
content = content_file.read()
content = re.sub(b"(?<!\r)\n","", content)
with open(amendedFile,'wb') as f:
f.write(content)
print("Done")
For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.
\r\nText should equal \r\nText
\t\nText should equal \tText
Text\nText should equal TextText
import re
import sys
import time
with open('14-09 - Copy.txt', 'rb') as content_file:
content = content_file.read()
newLinePos = [m.start() for m in re.finditer('\n', content)]
for line in newLinePos:
if (content[line-1]) != '\r':
print (repr(content[line-20:line]))
Thanks as always!

You probably want to use ([^\r])\n as your pattern, and then substitute \1 to keep the character before.
So your line would be
content = re.sub(b"([^\r])\n",r"\1", content)

Related

Python: Reading a file by using \n as the newline character. File also contains \r\n

I'm looking at a .CSV-file that looks like this:
Hello\r\n
my name is Alex\n
Hello\r\n
my name is John?\n
I'm trying to open the file with the newline-Character defined as '\n':
with open(outputfile, encoding="ISO-8859-15", newline='\n') as csvfile:
I get:
line1 = 'Hello'
line2 = 'my name is Alex'
line3 = 'Hello'
line4 = 'my name is John'
My desired result is:
line1 = 'Hello\r\nmy name is Alex'
line2 = 'Hello\r\nmy name is John'
Do you have any suggestions on how to fix this?
Thank you in advance!
I'm sure your answers are completely correct and technically advanced.
Sadly the CSV-File is not at all RFC 4180 compliant.
Therefore i'm going with the following solution and correct my temporary characters "||" afterwards:
with open(outputfile_corrected, 'w') as correctedfile_handle:
with open(outputfile, encoding="ISO-8859-15", newline='') as csvfile:
csvfile_content = csvfile.read()
csvfile_content_new = csvfile_content.replace('\r\n', '||')
correctedfile_handle.write(csvfile_content_new)
(Someone commented this, but answer has been deleted)
From documentation of the built-in function open in the standard library:
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
File object itself cannot explicitly distinguish data bytes (in your case) '\r\n' from separator '\n' - this is an authority of the bytes decoder. So, probably, as one of the options, it is possible to write your own decoder and use associated encoding as encoding of your text file. But this is a bit tedious and in case of small files it's much easier to use a more straightforward approach, using re module. The solution proposed by #Martijn Pieters should be used to iterate large files.
import re
with open('data.csv', 'tr', encoding="ISO-8859-15", newline='') as f:
file_data = f.read()
# Approach 1:
lines1 = re.split(r'(?<!\r)\n', file_data)
if not lines1[-1]:
lines1.pop()
# Approach 2:
lines2 = re.findall(r'(?:.+?(?:\r\n)?)+', file_data)
# Approach 3:
iterator_lines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+', file_data))
assert lines1 == lines2 == list(iterator_lines3)
print(lines1)
If we need '\n' at the end of each line:
# Approach 1:
nlines1 = re.split(r'(?<!\r\n)(?<=\n)', file_data)
if not nlines1[-1]:
nlines1.pop()
# Approach 2:
nlines2 = re.findall(r'(?:.+?(?:\r\n)?)+\n?', file_data)
# Approach 3:
iterator_nlines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+\n', file_data))
assert nlines1 == nlines2 == list(iterator_nlines3)
print(nlines1)
Results:
['Hello\r\nmy name is Alex', 'Hello\r\nmy name is John']
['Hello\r\nmy name is Alex\n', 'Hello\r\nmy name is John\n']
Python's line splitting algorithm can't do what you want; lines that end in \r\n also end in \r. At most you can set the newline argument to either '\n' or '' and re-join lines if they end in \r\n instead of \n. You can use a generator function to do that for you:
def collapse_CRLF(fileobject):
buffer = []
for line in fileobject:
if line.endswidth('\r\n'):
buffer.append(line)
else:
yield ''.join(buffer) + line
buffer = []
if buffer:
yield ''.join(buffer)
then use this as:
with collapse_CRLF(open(outputfile, encoding="ISO-8859-15", newline='')) as csvfile:
However, if this is CSV file, then you really want to use the csv module. It handles files with a mix of \r\n and \n endings for you as it knows how to preserve bare newlines in RFC 4180 CSV files, already:
import csv
with open(outputfile, encoding="ISO-8859-15", newline='') as inputfile:
reader = csv.reader(inputfile)
Note that in a valid CSV file, \r\n is the delimiter between rows, and \n is valid in column values. So if you did not want to use the csv module here for whatever reason, you'd still want to use newline='\r\n'.

Python writing apostrophes to file

I'm converting a downloaded Facebook Messenger conversation from JSON to a text file using Python. I've converted the JSON to text and it's all looking fine. I need to strip the unnecessary information and reverse the order of the messages, then save the output to a file, which I've done. However, when I am formatting the messages with Python, when I look at the output file, sometimes instead of an apostrophe, there's â instead.
My Python isn't great as I normally work with Java, so there's probably a lot of things I could improve. If someone could suggest some better tags for this question, I'd also be very appreciative.
Example of apostrophe working: You're not making them are you?
Example of apostrophe not working: Itâs just a button I discovered
What is causing this to happen and why does not happen every time there is an apostrophe?
Here is the script:
#/usr/bin/python3
import datetime
def main():
input_file = open('messages.txt', 'r')
output_file = open('results.txt', 'w')
content_list = []
sender_name_list = []
time_list = []
line = input_file.readline()
while line:
line = input_file.readline()
if "sender_name" in line:
values = line.split("sender_name")
sender_name_list.append(values[1][1:])
if "timestamp_ms" in line:
values = line.split("timestamp_ms")
time_value = values[1]
timestamp = int(time_value[1:])
time = datetime.datetime.fromtimestamp(timestamp / 1000.0)
time_truncated = time.replace(microsecond=0)
time_list.append(time_truncated)
if "content" in line:
values = line.split("content")
content_list.append(values[1][1:])
content_list.reverse()
sender_name_list.reverse()
time_list.reverse()
for x in range(1, len(content_list)):
output_file.write(sender_name_list[x])
output_file.write(str(time_list[x]))
output_file.write("\n")
output_file.write(content_list[x])
output_file.write("\n\n")
input_file.close()
output_file.close()
if __name__ == "__main__":
main()
Edit:
The answer to the question was adding
import codecs
input_file = codecs.open('messages.txt', 'r', 'utf-8')
output_file = codecs.open('results.txt','w', 'utf-8')
Without seeing the incoming data it's hard to be sure, but I suspect that instead of an apostrophe (Unicode U+0027 ' APOSTROPHE), you've got a curly-equivalent (U+2019 ’ RIGHT SINGLE QUOTATION MARK) in there trying to be interpreted as old-fashioned ascii.
Instead of
output_file = open('results.txt', 'w')
try
import codecs
output_file = codecs.open('results.txt','w', 'utf-8')
You may also need the equivalent on your input file.

How to read a specific paragraph from from multiple folders and files

I have a list that contains directories and filenames that I want to open, read a paragraph from and save that paragraph to a list.
The problem is that I don't know how to "filter" the paragraph out from the files and insert into my list.
My code so far.
rr = []
file_list = [f for f in iglob('**/README.md', recursive=True) if os.path.isfile(f)]
for f in file_list:
with open(f,'rt') as fl:
lines = fl.read()
rr.append(lines)
print(rr)
The format of the file I'm trying to read from. The text between the paragraph start and the new paragraph is what I'm looking for
There is text above this paragraph
## Required reading
* line
* line
* line
/n
### Supplementary reading
There is text bellow this paragraph
When I run the code I get all the lines from the files as expected.
You need to learn how your imported text is structured. How are the paragraphs segregated? does it look like '\n\n', could you split your text file on '\n\n' and return the index of the paragraph you want?
text = 'paragraph one text\n\nparagraph two text\n\nparagraph three text'.split('\n\n')[1]
print(text)
>>> 'paragraph two text'
The other option, as someone else mentioned is Regular Expression aka RegEx, which you can import using
import re
RegEx is used to find patterns in text.
Go to https://pythex.org/ and grab a sample of one of the documents and experiment findingthe pattern that will match with the paragraph you want to find.
Learn more about RegEx here
https://regexone.com/references/python
Solved my problem with string slicing.
Basically, I just scan each line for a start String and an end String and makes lines out of that. These lines then get appended to a list and written into a file.
for f in file_list:
with open(f, 'rt') as fl:
lines = fl.read()
lines = lines[lines.find('## Required reading'):lines.find('## Supplementary reading')]
lines = lines[lines.find('## Required reading'):lines.find('### Supplementary reading')]
lines = lines[lines.find('## Required reading'):lines.find('## Required reading paragraph')]
rr.append(lines)
But I still have "## Required reading" in my list and in my file so I run a second read/write method.
def removeHashTag():
global line
f = open("required_reading.md", "r")
lines = f.readlines()
f.close()
f = open("required_reading.md", "w")
for line in lines:
if line != "## Required reading" + "\n":
f.write(line)
f.close()
removeHashTag()

Trying to do a Find/Replace Across Several Lines in Several Text Files

I have several blocks of text that look like this:
steps:
- class: pipe.steps.extract.Extract
conf:
unzip_patterns:
- .*EstimatesDaily_RealEstate_Q.*_{FD_YYYYMMDD}.*
id: extract
- class: pipe.steps.validate.Validate
conf:
schema_def:
fields:
I want to replace this block of text with this:
global:
global:
schema_def:
fields:
The catch here is that the text crosses several lines in each text file. Maybe there is an easy workaround for this, not sure. More troublesome, is that is don't always have '- .*EstimatesDaily_RealEstate_Q.*_{FD_YYYYMMDD}.*'. Sometimes the text is '- .*EstimatesDaily_RealEstate_Y.*_{FD_YYYYMMDD}.*' or it could be '- .*EstimatesDaily_RealEstate_EAP_Nav.*_{FD_YYYYMMDD}.*' One thng that is always the same in each block is that it starts with this ' steps:' and ends with this ' fields:'.
My sample code looks like this:
import glob
import re
path = 'C:\\Users\\ryans\\OneDrive\\Desktop\\output\\*.yaml'
regex = re.compile("steps:.*fields:", re.DOTALL)
print(regex)
replace = """global:
global:
schema_def:
fields:"""
for fname in glob.glob(path):
#print(str(fname))
with open(fname, 'r+') as f:
text = re.sub(regex, replace, '')
f.seek(0)
f.write(text)
f.truncate()
Of course, my example isn't simple.
Since you're doing a general replacement of things between strings, I'd say this calls for a regular expression [EDIT: Sorry, I see you've since replaced your string "replace" statements with regexp code]. So if your file is "myfile.txt", try this:
>>> import re
>>> f = open('myfile.txt', 'r')
>>> content = f.read()
>>> f.close()
>>> replacement = ' global:\n global:\n schema_def:\n fields:'
>>> print re.sub(r"(\ssteps\:)(.*?)(\sfields\:)", replacement, content, flags=re.DOTALL)
The output here should be the original contents of "myfile.txt" with all of the substitutions.
Instead of editing files directly, the usual convention in Python is to just copy what you need from a file, change it, and write everything back to a new file. It's less error prone this way, and should be fine unless you're dealing with an astronomically huge amount of content. So you could replace the last line I have here with something like this:
>>> newcontent = re.sub(r"(\ssteps\:)(.*?)(\sfields\:)", replacement, content, flags=re.DOTALL)
>>> f = open('newfile.txt', 'w')
>>> f.write(newcontent)
>>> f.close()
Regex is the best answer here probably. Will make this simple. Your mileage will vary with my example regex. Make it as tight as you need to make sure you only replace what you need to and dont get false positives.
import re
#re.DOTALL means it matches across newlines!
regex = re.compile("steps:.*?fields:", flags=re.DOTALL, count=1)
replace = """global:
global:
schema_def:
fields:"""
def do_replace(fname):
with open(fname, 'r') as f:
in = f.read()
with open(fname, 'w') as f:
f.write(re.sub(regex, replace, in))
for fname in glob.glob(path):
print(str(fname))
do_replace(fname)

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)
You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Categories