Deleting surplus blank lines using Python

Deleting surplus blank lines using Python - python

I want the Notepad++'s wonderful feature "Delete Surplus blank lines" in Python.
Say if I have file like this
A
B
C
D
I want
A
B
C
D
What is the pythonic way of doing this?
Here is what I tried
A=['a','\n','\n','\n','a','b','\n','\n','C','\n','\n','\n','\n','\n','\n','D']
B=[]
count=0
for l in range(0,len(A)):
if A[l]=='\n':
count=count+1
else:
count=0
if count>1:
if A[l+1]=='\n':
continue
else:
B.append('\n')
else:
if A[l]!='\n':
B.append(A[l])
print B

Make sure there's no more than \n\n, eg:
import re
print re.sub('\n{3,}', '\n\n', your_string, flags=re.M)
And, using itertools.groupby for large files:
from itertools import groupby
with open('your_file') as fin:
for has_value, lines in groupby(fin, lambda L: bool(L.strip())):
if not has_value:
print
continue
for line in lines:
print line,

Here is a one-liner:
In [35]: A=['a','\n','\n','\n','a','b','\n','\n','C','\n','\n','\n','\n','\n','\n','D']
In [36]: B = [A[0]] + [A[i] for i in range(1, len(A)) if A[i] != '\n' or A[i-1] != '\n']
In [37]: B
Out[37]: ['a', '\n', 'a', 'b', '\n', 'C', '\n', 'D']
It basically omits newlines that follow other newlines.

Is this what you are looking for?
>>> def delete_surplus_blank_lines(text):
while '\n\n\n' in text:
text = text.replace('\n\n\n', '\n\n')
return text
>>> text = 'a\n\n\nab\n\nC\n\n\n\n\n\nD'
>>> print(text)
a
ab
C
D
>>> print(delete_surplus_blank_lines(text))
a
ab
C
D
>>>
A more efficient implementation (based on ideas from NPE) would be:
def delete_surplus_blank_lines(text):
return text[:2] + ''.join(text[index] for index in range(2, len(text))
if text[index-2:index+1] != '\n\n\n')
A one-liner of that function is fairly easy to create with a lambda:
delete_surplus_blank_lines = lambda text: return text[:2] + ''.join(text[index] for index in range(2, len(text)) if text[index-2:index+1] != '\n\n\n')

You have a file, so lets define a function called clean_up to clean up the file you give:
def clean_up(file_name,blanks=1):
with open(file_name,'r+') as f:
blank = 0
for line in f:
if blank < blanks:
if line == "\n":
blank += 1
f.write(line)
else:
blank = 0
if line != "\n":
f.write(line)
Now this will iterate through your file, and make sure there are no more than blanks number of blank lines in a row!

Related

Exclude empty lines and comment lines

import os
def countlines(start, lines=0, header=True, begin_start=None):
if header:
print('{:>10} |{:>10} | {:<20}'.format('ADDED', 'TOTAL', 'FILE'))
print('{:->11}|{:->11}|{:->20}'.format('', '', ''))
for thing in os.listdir(start):
thing = os.path.join(start, thing)
if os.path.isfile(thing):
if thing.endswith('.py'):
with open(thing, 'r') as f:
newlines = f.readlines()
newlines = list(filter(lambda l: l.replace(' ', '') not in ['\n', '\r\n'], newlines))
newlines = list(filter(lambda l: not l.startswith('#'), newlines))
newlines = len(newlines)
lines += newlines
if begin_start is not None:
reldir_of_thing = '.' + thing.replace(begin_start, '')
else:
reldir_of_thing = '.' + thing.replace(start, '')
print('{:>10} |{:>10} | {:<20}'.format(
newlines, lines, reldir_of_thing))
for thing in os.listdir(start):
thing = os.path.join(start, thing)
if os.path.isdir(thing):
lines = countlines(thing, lines, header=False, begin_start=start)
return lines
countlines(r'/Documents/Python/')
If we take the standard Python file .main.py, then there are 4 lines of code in it. And he counts as 5. How to fix it?
How to properly set up a filter so that it does not count empty lines of code and comments?

1. You can modify your first filter condition: strip the line, and then check that it isn't empty.
lambda l: l.replace(' ', '') not in ['\n', '\r\n']
becomes
lambda l: l.strip()
2. filter takes any iterable, so no need to convert it to lists every time - this is a waste because it forces two sets of iterations - one when you create the list, another when you filter it a second time. You could remove the calls to list() and only do it once after all your filtering is done. You can also use filter on the file handle itself, since the file handle f is an iterable that yields lines from the file in every iteration. This way, you only iterate over the entire file once.
newlines = filter(lambda l: l.strip(), f)
newlines = filter(lambda l: not l.strip().startswith('#'), newlines)
num_lines = len(list(newlines))
Note that I renamed the last variable, because a variable name should describe what it is
3. You can combine both your filter condition into a single lambda
lambda l: l.strip() and not l.strip().startswith('#')
or, if you have Python 3.8+,
lambda l: (l1 := l.strip()) and not l1.startswith('#')
This makes my point #2 about not listing out the above moot -
num_lines = len(list(filter(lambda l: (l1 := l.strip()) and l1.startswith('#'), f)))
With the following input, this gives the correct line count:
file.py:
print("Hello World")
# This is a comment
# The next line is blank
print("Bye")
>>> with open('file.py') as f:
... num_lines = len(list(filter(lambda l: (l1 := l.strip()) and l1.startswith('#'), f)))
... print(num_lines)
Out: 2

Fastest way to convert files into lists?

I have a .txt file which contains some words:
e.g
bye
bicycle
bi
cyc
le
and i want to return a list which contains all the words in the file. I have tried some code which actually works but i think it takes a lot of time to execute for bigger files. Is there a way to make this code more efficient?
with open('file.txt', 'r') as f:
for line in f:
if line == '\n': --> #blank line
lst1.append(line)
else:
lst1.append(line.replace('\n', '')) --> #the way i find more efficient to concatenate letters of a specific word
str1 = ''.join(lst1)
lst_fin = str1.split()
expected output:
lst_fin = ['bye', 'bicycle', 'bicycle']

I don't know if this is more efficient, but at least it's an alternative... :)
with open('file.txt') as f:
words = f.read().replace('\n\n', '|').replace('\n', '').split('|')
print(words)
...or if you don't want to insert a character like '|' (which could be already there) into the data you could do also
with open('file.txt') as f:
words = f.read().split('\n\n')
words = [w.replace('\n', '') for w in words]
print(words)
result is the same in both cases:
# ['bye', 'bicycle', 'bicycle']
EDIT:
I think I have another approach. However, it requires the file not to start with a blank line, iiuc...
with open('file.txt') as f:
res = []
current_elmnt = next(f).strip()
for line in f:
if line.strip():
current_elmnt += line.strip()
else:
res.append(current_elmnt)
current_elmnt = ''
print(words)
Perhaps you want to give it a try...

You can use the iter function with a sentinel of '' instead:
with open('file.txt') as f:
lst_fin = list(iter(lambda: ''.join(iter(map(str.strip, f).__next__, '')), ''))
Demo: https://repl.it/#blhsing/TalkativeCostlyUpgrades

You could use this(I don't know about its efficiency):
lst = []
s = ''
with open('tp.txt', 'r') as file:
l = file.readlines()
for i in l:
if i == '\n':
lst.append(s)
s = ''
elif i == l[-1]:
s += i.rstrip()
lst.append(s)
else:
s+= i.rstrip()
print(lst)

Python: Parsing a specific column (from scratch, no "import csv") in tab-separated-file

I've written some code that can parse a string into tuples as such:
s = '30M3I5X'
l = []
num = ""
for c in s:
if c in '0123456789':
num = num + c
print(num)
else:
l.append([int(num), c])
num = ""
print(l)
I.e.;
'30M3I5X'
becomes
[[30, 'M'], [3, 'I'], [5, 'X']]
That part works just fine. I'm struggling now, however, with figuring out how to get the values from the first column of a tab-separated-value file to become my new 's'. I.e.; for a file that looks like:
# File Example #
30M3I45M2I20M I:AAC-I:TC
50M3X35M2I20M X:TCC-I:AG
There would somehow be a loop incorporated to take only the first column, producing
[[30, 'M'],[3, 'I'],[45, 'M'],[2, 'I'],[20, 'M']]
[[50, 'M'],[3, 'X'],[35, 'M'],[2, 'I'],[20, 'M']]
without having to use
import csv
Or any other module.
Thanks so much!

Just open the path to the file and iterate through the records?
def fx(s):
l=[]
num=""
for c in s:
if c in '0123456789':
num=num+c
print(num)
else:
l.append([int(num), c])
num=""
return l
with open(fp) as f:
for record in f:
s, _ = record.split('\t')
l = fx(s)
# process l here ...

The following code would serve your purpose
rows = ['30M3I45M2I20M I:AAC-I:TC', '30M3I45M2I20M I:AAC-I:TC']
for row in rows:
words = row.split(' ')
print(words[0])
l = []
num = ""
for c in words[0]:
if c in '0123456789':
num = num + c
else:
l.append([int(num), c])
print(l)
Change row.split(' ') to ('\t') or any other seperator as per the need

something like this should do what you're looking for.
filename = r'\path\to\your\file.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
# processing goes here
elements[0] contains the string that is the first column of data in the file.
Edit:
to end up with a list of the lists of processed data:
result = []
filename = r'\path\to\your\file.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
# processing goes here
result.append(l) # l is the result of your processing

So this is what ended up working for me--took bits and pieces from everyone, thank you all!
Note: I know it's a bit verbose, but since I'm new, it helps me keep track of everything :)
#Defining the parser function
def col1parser(col1):
l = []
num = ""
for c in col1:
if c in '0123456789':
num = num + c
else:
l.append([int(num), c])
num = ""
print(l)
#Open file, run function on column1
filename = r'filepath.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
col1 = elements[0]
l = col1parser(col1)

Delete and save duplicate in another file

In test.txt:
1 a
2 b
3 c
4 a
5 d
6 c
I want to remove duplicate and save the rest in test2.txt:
2 b
5 d
I tried to start with the codes below.
file1 = open('../test.txt').read().split('\n')
#file2 = open('../test2.txt', "w")
word = set()
for line in file1:
if line:
sline = line.split('\t')
if sline[1] not in word:
print sline[0], sline[1]
word.add(sline[1])
#file2.close()
The results from the codes showed:
1 a
2 b
3 c
5 d
Any suggestion?

You can use collections.Orderedict here:
>>> from collections import OrderedDict
with open('abc') as f:
dic = OrderedDict()
for line in f:
v,k = line.split()
dic.setdefault(k,[]).append(v)
Now dic looks like:
OrderedDict([('a', ['1', '4']), ('b', ['2']), ('c', ['3', '6']), ('d', ['5'])])
Now we only need those keys which contain only 1 items in the list.
for k,v in dic.iteritems():
if len(v) == 1:
print v[0],k
...
2 b
5 d

What you're doing is that you're just making sure every second item (letter) gets printed out only once. Which obviously is not what you're saying you want.
You must split your code into two halfs - reading and gathering statistics about letter counts, and part which prints only those which has count == 1.
Converting your original code (I just made it a little simpler):
file1 = open('../test.txt')
words = {}
for line in file1:
if line:
line_num, letter = line.split('\t')
if letter not in words:
words[letter] = [1, line_num]
else:
words[letter][0] += 1
for letter, (count, line_num) in words.iteritems():
if count == 1:
print line_num, letter

I tried to keep it as similar to your stlye as possible:
file1 = open('../test.txt').read().split('\n')
word = set()
test = []
duplicate = []
sin_duple = []
num_lines = 0;
num_duplicates = 0;
for line in file1:
if line:
sline = line.split(' ')
test.append(" ".join([sline[0], sline[1]]))
if (sline[1] not in word):
word.add(sline[1])
num_lines = num_lines + 1;
else:
sin_duple.append(sline[1])
duplicate.append(" ".join([sline[0], sline[1]]))
num_lines = num_lines + 1;
num_duplicates = num_duplicates + 1;
for i in range (0,num_lines+1):
for item in test:
for j in range(0, num_duplicates):
#print((str(i) + " " + str(sin_duple[j])))
if item == (str(i) + " " + str(sin_duple[j])):
test.remove(item)
file2 = open("../test2.txt", 'w')
for item in test:
file2.write("%s\n" % item)
file2.close()

How about some Pandas
import pandas as pd
a = pd.read_csv("test_remove_dupl.txt",sep=",")
b = a.drop_duplicates(cols="a")

How to parse data in a variable length delimited file?

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.
Sample text file :
# # # #
Techy Inn Val NJ
Found the position of # using this code :
1 f = open('sample.txt', 'r')
2 i = 0
3 positions = []
4 for line in f:
5 if line.find('#') > 0:
6 print line
7 for each in line:
8 i += 1
9 if each == '#':
10 positions.append(i)
1 7 11 15 => Positions
So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:

Here's a way to read fixed width fields using regexp
>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>

Off the top of my head:
f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]
Update with tested, fixed code:
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
posns[-1] = len(line)
fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
print fields
Input file:
# # #
Foo BarBaz
123456789abcd
Debug output:
'# # #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']
Robustification notes:
This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).
Final(?) update: Leapfrooging #gnibbler's suggestion to use slice(): set up the slices once before looping.
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
fields = [line[sl].rstrip() for sl in slices]
print fields

Adapted from John Machin's answer
>>> header = "# # # #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']
You can also write the last line like this
>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]
For the other example you give in the comments, you just need to have the correct header
>>> header = "# # # #"
>>> row = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>

Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.
Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)
sample="""
HOTEL CAT ST DEP ##
Test line Techy Inn Val NJ FT FT
"""
data=sample.splitlines()[1:]
def fields(header,line):
previndex=0
prevchar=''
for index,char in enumerate(header):
if char == '#' or (prevchar != char and prevchar == ' '):
if previndex or header[0] != ' ':
yield line[previndex:index]
previndex=index
prevchar = char
yield line[previndex:]
header,dataline = data
print list(fields(header,dataline))
Output
['Techy Inn ', 'Val ', 'NJ ', 'FT ', 'F', 'T']
One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.
Header from sample line:
' Techy_Inn Val NJ FT ##'

def parse(your_file):
first_line = your_file.next().rstrip()
slices = []
start = None
for e, c in enumerate(first_line):
if c != '#':
continue
if start is None:
start = e
continue
slices.append(slice(start, e))
start = e
if start is not None:
slices.append(slice(start, None))
for line in your_file:
parsed = [line[s] for s in slices]
yield parsed

f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
print [line[i:j].strip() for i, j in pos]
f.close()

How about this?
with open('somefile','r') as source:
line= source.next()
sizes= map( len, line.split("#") )[1:]
positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ]
for line in source:
fields = [ line[start,end] for start,end in positions ]
Is this what you're looking for?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Deleting surplus blank lines using Python - python

Related

Exclude empty lines and comment lines

Fastest way to convert files into lists?

Python: Parsing a specific column (from scratch, no "import csv") in tab-separated-file

Delete and save duplicate in another file

How to parse data in a variable length delimited file?

Categories

Resources