I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.
Related
I am trying to figure out how to write a simple regex that would highlight newline characters only if they appear at the beginning or end of some data while preserving the newline.
In the below example, line 1 and line 14 both are new lines. Those are the only two lines I am trying to highlight as they appear at the beginning and end of the data.
import regex as re
from colorama import Fore, Back
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
print(
re.sub(r'(^\n|\n$)', red(r'\1'), data)
)
In the open expression, data is the same content as the example posted above.
In the above example, this is the result I am getting:
As one can see, the red highlight is missing on line 1 and is spanning all the way in line 14. What I would like is for the color to appear only once per new line character.
You can actually use your regex, but without the "multiline" flag. Than it will see the whole string as one and you will actually match your desired output.
^\n|\n$
Here you can see that there are two matches. And if you delete new lines in front or in the end, the matches will disapear. The multilene flag is set or disabled at the end of the regex line. You could do that in your language too.
https://regex101.com/r/pSRHPU/2
After reading all the comments, and suggestions, and combining a subset of them all, I finally have a working version. For anyone that is interested:
One issue I cannot overcome without writing an os specific check is how an extra new line being added for windows.
A couple of highlights that were picked up:
cannot color a \n. So replace that with a space and newline.
have not tested this, but by getting rid of the group replacement, it may be possible to apply this to bytes also.
Windows supported can be attained with init in colorama
import regex as re
from colorama import Back, init
init() # for windows
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
fist_line = re.sub('\A\n', red(' ')+'\n', data)
last_line = re.sub('\n\Z', '\n'+red(' '), fist_line)
print(last_line)
OSX/Linux
Windows
I found a way that seems to allow you to match the start/end of the whole string. See the "Permanent Start of String and End of String Anchors" part from https://www.regular-expressions.info/anchors.html
\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string.
I created a demo here https://regex101.com/r/n2DAWh/1
Regex is: (\A\n|\n\Z)
I want to compare two text files in Python, and return the lines that are different. My attempt uses difflib, but I'm open to other suggestions. I need to get the lines that are different, as well as the lines that appear in one file but not the other. Order is somewhat important, but if a good solution exists that doesn't take order into consideration, I can let go of that.
The problem is that one file has lines that have multiple trailing characters \t and \n, while the other doesn't; I don't want to consider that as a difference. For other files, the first file has only \n and the other files has \t characters at the end. The lines contain elements that are separated by tabs or spaces, so those are important; I just don't care for the trailing characters \t and \n.
My solution:
from difflib import Differ
with open(file_path) as actual:
with open(test_file_path) as test:
differ = Differ()
for line in differ.compare(actual.readlines(), test.readlines()):
if line.startswith('-'):
log.error('EXPECTED: {}'.format(line[2:]))
if line.startswith('+'):
log.error('TEST FILE: {}'.format(line[2:]))
I expect the output to show EXPECTED and TEST FILE lines when there's a difference, and just EXPECTED or just TEST FILE when one contains a line the other doesn't. Right now, I'm seeing a lot of the following types of errors:
00:02:40: ERROR EXPECTED: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
00:02:40: ERROR TEST FILE: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
As you can see (if you highlight it), the first line contains a number of spaces after 'Grp' and the other line doesn't. I want to consider these two lines the same.
I've tried to explicitly specify the tabs and line breaks:
actual_file = actual.readlines()
expected_file = []
for line in actual_file:
if line[-1] == '\n':
expected_file.append(line.rstrip('\n').rstrip('\t') + '\n')
else:
expected_file.append(line.rstrip('\t'))
However, it (a) slows the process down quite a bit, and (b) is required for every file type in a different way, since some files have trailing tabs followed by line breaks, some have just line breaks, and some have nothing at all. If there's no better way, I can strip every line of every trailing tab and linebreak, but it seems like a lot of processing power (I have to run a lot of files) for something that seems fairly easy to resolve.
Take a look at string.rstrip() here: https://docs.python.org/2/library/string.html#string.rstrip
string.rstrip() should do exactly what you need by stripping whitespace off the end of a string, while leaving \t and \n characters before the end alone.
Check it out:
>>> import string
>>> s = "This \t is \t a \t line \t\t\t\n\n\n"
>>> print(s)
This is a line
>>>
>>> s = string.rstrip(s)
>>> s
'This \t is \t a \t line'
>>> print(s)
This is a line
>>>
Hope this helps!
I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.
I have data in the following form in a file:
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established</text\u003e\n______<sha1\u003eqwjfowt5my8t6yuszdb88k2ehskjuh0</sha1\u003e\n____</revision\u003e\n__</page\u003e\n__<page\u003e\n____<title\u003ePortal:Tropical_cyclones/Anniversaries/August_22</title\u003e\n____<ns\u003e100</ns\u003e\n____<id\u003e7957689</id\u003e\n____<revision\u003e\n______<id\u003e446349886</id\u003e\n______<timestamp\u003e2011-08-23T17:38:19Z</timestamp\u003e\n______<contributor\u003e\n________<username\u003eLightbot</username\u003e\n________<id\u003e7178666</id\u003e\n______</contributor\u003e\n______<comment\u003eDelink_non-obscure_units._Conversions._Report_bugs_to_[[User_talk:Lightmouse>.
The delimiter in the above file is a tab (\t) i.e. string1 is separated from abc:string2by \t. Similarly for the rest of the strings.
Now I want to retain just alphabets, numbers, /, :,'.' and _ within the strings which are enclosed within <>. I want to delete all the characters apart from the specified ones from the strings which are enlosed in <>.
Is there some way by which I may achieve this using linux commands or python? I want to replace all the unwanted characters by an underscore.
<string1> abc:string2 <http://yago-knowledge.org/resource/wikicategory_Sports_clubs_established_text_u003e_n_______sha1_u003eqwjfowt5my8t6yuszdb88k2ehskjuh0_sha1_u003e_n_____revision_u003e_n___/page_u003e_n___page_u003e_n_____title_u003ePortal:Tropical_cyclones/Anniversaries/August_22_/title_u003e_n_____ns_u003e100_/ns_u003e_n_____id_u003e7957689_/id_u003e_n_____revision_u003e_n_______id_u003e446349886_/id_u003e_n_______timestamp_u003e2011-08-23T17:38:19Z_/timestamp_u003e_n_______contributor_u003e_n_________username_u003eLightbot_/username_u003e_n_________id_u003e7178666_/id_u003e_n_______/contributor_u003e_n_______comment_u003eDelink_non-obscure_units._Conversions._Report_bugs_to___User_talk:Lightmouse>.
Is there some way by which I may achieve this?
You can probably achieve this just with UNIX tools and some crazy regular expression, but I would write a small Python script for this:
Open two files (input and output) with open()
Iterate over the input file line by line: for line in input_file:
Split the line at tab: for part in line.split('\t'):
Check if a part is enclosed in <>: if part.startswith('<') and line.endswith('>'):
Filter characters with a regular expression: filtered_part = re.sub(r'[^a-zA-Z0-9/:._]', '', part)
Join the filtered parts back together: filtered_line = '\t'.join(filtered_parts)
Write the filtered line to the output file: output_file.write(filtered_line + '\n')
Following this outline, it should be easy for you to write a working script.
I want to convert Python multiline string to a single line. If I open the string in a Vim , I can see ^M at the start of each line. How do I process the string to make it all in a single line with tab separation between each line. Example in Vim it looks like:
Serialnumber
^MName Rick
^MAddress 902, A.street, Elsewhere
I would like it to be something like:
Serialnumber \t Name \t Rick \t Address \t 902, A.street,......
where each string is in one line. I tried
somestring.replace(r'\r','\t')
But it doesn't work. Also, once the string is in a single line if I wanted a newline(UNIX newline?) at the end of the string how would I do that?
Deleted my previous answer because I realized it was wrong and I needed to test this solution.
Assuming that you are reading this from the file, you can do the following:
f = open('test.txt', 'r')
lines = f.readlines()
mystr = '\t'.join([line.strip() for line in lines])
As ep0 said, the ^M represents '\r', which the carriage return character in Windows. It is surprising that you would have ^M at the beginning of each line since the windows new-line character is \r\n. Having ^M at the beginning of the line indicates that your file contains \n\r instead.
Regardless, the code above makes use of a list comprehension to loop over each of the lines read from test.txt. For each line in lines, we call str.strip() to remove any whitespace and non-printing characters from the ENDS of each line. Finally, we call '\t'.join() on the resulting list to insert tabs.
You can replace "\r" characters by "\t".
my_string.replace("\r", "\t")
I use splitlines() to detect all types of lines, and then join everything together. This way you don't have to guess to replace \r or \n etc.
"".join(somestring.splitlines())
it is hard coding. But it works.
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
lst=list(poem)
str=''
for i in lst:
str+=i
print(str)
lst1=str.split("\n")
str1=""
for i in lst1:
str1+=i+" "
str2=str1[:-2]
print(str2)
This occurs of how VIM interprets CR (carriage return), used by Windows to delimit new lines. You should use just one editor (I personally prefer VIM). Read this: VIM ^M
This trick also can be useful, write "\n" as a raw string. Like :
my_string = my_string.replace(r"\n", "\t")
this should do the work:
def flatten(multiline):
lst = multiline.split('\n')
flat = ''
for line in lst:
flat += line.replace(' ', '')+' '
return flat
This should do the job:
string = """Name Rick
Address 902, A.street, Elsewhere"""
single_line = string.replace("\n", "\t")