remove one comma using a python script - python

I have csv file with a line that looks something like this:
,,,,,,,,,,
That's 10 commas. I wish to remove only the last (i.e. the 10th) comma, so that the line changes to:
,,,,,,,,,
Has anyone had any experience dealing with a case like this? I use the vim text editor also. So, any help using python script or text editing commands using vim would be appreciated.

Removing last comma in current line in vim:
:s/,$//
The same for lines n through m:
:n,ms/,$//
The same for whole file:
:%s/,$//

This will do it in the simplest case, once you've updated your question with what you're looking for, I'll update the code.
commas = ",,,,,,,,,,,"
print commas.replace(","*10, ","*9)
If you want to remove the last comma on any given line you can do:
import re
commas = """,,,,,,,,,,
,,,,,,,,,,"""
print re.sub(r',$', '', commas, re.MULTILINE)
And if, in any file, you want to take a line that is just 10 commas and make it 9 commas:
import re
commas = ",,,,,,,,,,\n,,,,,,,,,,\n,,,,,,,,,,"
print re.sub(r'^,{10}$', ','*9, commas, re.MULTILINE)

I would use:
sed -i 's/,$//' file.csv

I really love the VIM normal command. So if you want to remove the last "column" from this CSV file I'd do like this:
:%normal $F,D
In other words, execute in every line of this file (%) the following procedures (normal):
$ - go to the end of the line;
F, - Move the cursor to the previews comma in this line;
D - Delete from the cursor until the end of the line;
Also, this can be used with ranges (from line 1 to 20):
:1,20normal $F,D
But if there are nothing but a lot of commas with no data between them, you can simply do:
:%normal $X

Related

regex to highlight new line characters in the beginning and end

I am trying to figure out how to write a simple regex that would highlight newline characters only if they appear at the beginning or end of some data while preserving the newline.
In the below example, line 1 and line 14 both are new lines. Those are the only two lines I am trying to highlight as they appear at the beginning and end of the data.
import regex as re
from colorama import Fore, Back
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
print(
re.sub(r'(^\n|\n$)', red(r'\1'), data)
)
In the open expression, data is the same content as the example posted above.
In the above example, this is the result I am getting:
As one can see, the red highlight is missing on line 1 and is spanning all the way in line 14. What I would like is for the color to appear only once per new line character.
You can actually use your regex, but without the "multiline" flag. Than it will see the whole string as one and you will actually match your desired output.
^\n|\n$
Here you can see that there are two matches. And if you delete new lines in front or in the end, the matches will disapear. The multilene flag is set or disabled at the end of the regex line. You could do that in your language too.
https://regex101.com/r/pSRHPU/2
After reading all the comments, and suggestions, and combining a subset of them all, I finally have a working version. For anyone that is interested:
One issue I cannot overcome without writing an os specific check is how an extra new line being added for windows.
A couple of highlights that were picked up:
cannot color a \n. So replace that with a space and newline.
have not tested this, but by getting rid of the group replacement, it may be possible to apply this to bytes also.
Windows supported can be attained with init in colorama
import regex as re
from colorama import Back, init
init() # for windows
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
fist_line = re.sub('\A\n', red(' ')+'\n', data)
last_line = re.sub('\n\Z', '\n'+red(' '), fist_line)
print(last_line)
OSX/Linux
Windows
I found a way that seems to allow you to match the start/end of the whole string. See the "Permanent Start of String and End of String Anchors" part from https://www.regular-expressions.info/anchors.html
\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string.
I created a demo here https://regex101.com/r/n2DAWh/1
Regex is: (\A\n|\n\Z)

Replacing a numeric string with another in a .txt file [duplicate]

This question already has answers here:
How to convert a string to a number if it has commas in it as thousands separators?
(10 answers)
Closed 3 years ago.
My .txt have a lot of numbers divided in two rows but they are in the brazilian way to write them (this means the number 3.41 is writen as 3,41)... I know how to read each column, I just need to change every comma in the .txt to a dot, but I have no idea how to do that...
There's 3 ways I thought how to solve the problem:
Changing every comma into a dot and overwrite the previous .txt,
Write another .txt with another name, but with every comma changed into a dot,
Import every string (that should be float) from the txt and use replace to change the "," into a ".".
If you can help me with one of the first two ways would be better, especially the first one
(I just imported numpy and don't know how to use others library yet, so if you could help me with the codes and recipes I would really appreciate that) (sorry about the bad english, love ya)
Try this:
with open('input.txt') as input_f, open('output.txt', 'w') as output_f:
for line in input_f.readlines():
output_f.write(line.replace(',', '.'))
for input.txt:
1,2,3,4,5
10,20,30,40
the output will be:
1.2.3.4.5
10.20.30.40.
while your question is tagged python, here's a super-simple non-pythonic way, using the sed cmdline utility.
This will replace all commas (,) with dots (.) in your textfile, overwriting the original file:
sed -e 's/,/./g' -i yourtext.txt
Or, if you want the output in a different file:
sed -e 's/,/./g' yourtext.txt > newfile.txt
umlaute's answer is good, but if you insist on doing this in Python you can use fileinput, which supports inplace replacement:
import fileinput
with fileinput.FileInput(filename, inplace=True, backup='.bak') as file:
for line in file:
line.replace(',', '~').replace('.', ',').replace('~', '.')
This example assumes you have .'s and ,'s in your example already so uses the tilde as an interim character while fixing both characters. If you have ~'s in your data, feel free to swap that out for another uncommon character.
If you're working with a csv, be careful not to replace your column delimiter character. In this case, you'll probably want to use regex replace instead to ensure that each comma replaced is surrounded by digits: r'\d,\d'

Most efficient way to delete needless newlines in Python

I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

replace a string with regular expression in python

I have been learning regular expression for a while but still find it confusing sometimes
I am trying to replace all the
self.assertRaisesRegexp(SomeError,'somestring'):
to
self.assertRaiseRegexp(SomeError,somemethod('somestring'))
How can I do it? I am assuming the first step is fetch 'somestring' and modify it to somemethod('somestring') then replace the original 'somestring'
here is your regular expression
#f is going to be your file in string form
re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this will grab something that matches and replace it accordingly. It will also make sure that the quotation marks line up correctly by setting a reference in quote
there is no need to iterate over the file here either, the first statement "(?m)" puts it in multiline mode so it maps the regular expression over each line in the file. I have tested this expression and it works as expected!
test
>>> print f
this is some
multi line example that self.assertRaisesRegexp(SomeError,'somestring'):
and so on. there self.assertRaisesRegexp(SomeError,'somestring'): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,'somestring'): okay
im done now
>>> print re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this is some
multi line example that self.assertRaisesRegexp(SomeError,somemethod('somestring')):
and so on. there self.assertRaisesRegexp(SomeError,somemethod('somestring')): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,somemethod('somestring')): okay
im done now
A better tool for this particular task is sed:
$ sed -i 's/\(self.assertRaisesRegexp\)(\(.*\),\(.*\))/\1(\2,somemethod(\3))/' *.py
sed will take care of the file I/O, renaming files, etc.
If you already know how to do the file manipulation, and iterating over lines in each file, then the python re.sub line will look like:
new_line = re.sub(r"(self.assertRaisesRegexp)\((.*),(.*)\)",
r"\1(\2,somemethod(\3)",
old_line)

Categories