Exclude empty lines and comment lines - python

import os
def countlines(start, lines=0, header=True, begin_start=None):
if header:
print('{:>10} |{:>10} | {:<20}'.format('ADDED', 'TOTAL', 'FILE'))
print('{:->11}|{:->11}|{:->20}'.format('', '', ''))
for thing in os.listdir(start):
thing = os.path.join(start, thing)
if os.path.isfile(thing):
if thing.endswith('.py'):
with open(thing, 'r') as f:
newlines = f.readlines()
newlines = list(filter(lambda l: l.replace(' ', '') not in ['\n', '\r\n'], newlines))
newlines = list(filter(lambda l: not l.startswith('#'), newlines))
newlines = len(newlines)
lines += newlines
if begin_start is not None:
reldir_of_thing = '.' + thing.replace(begin_start, '')
else:
reldir_of_thing = '.' + thing.replace(start, '')
print('{:>10} |{:>10} | {:<20}'.format(
newlines, lines, reldir_of_thing))
for thing in os.listdir(start):
thing = os.path.join(start, thing)
if os.path.isdir(thing):
lines = countlines(thing, lines, header=False, begin_start=start)
return lines
countlines(r'/Documents/Python/')
If we take the standard Python file .main.py, then there are 4 lines of code in it. And he counts as 5. How to fix it?
How to properly set up a filter so that it does not count empty lines of code and comments?

1. You can modify your first filter condition: strip the line, and then check that it isn't empty.
lambda l: l.replace(' ', '') not in ['\n', '\r\n']
becomes
lambda l: l.strip()
2. filter takes any iterable, so no need to convert it to lists every time - this is a waste because it forces two sets of iterations - one when you create the list, another when you filter it a second time. You could remove the calls to list() and only do it once after all your filtering is done. You can also use filter on the file handle itself, since the file handle f is an iterable that yields lines from the file in every iteration. This way, you only iterate over the entire file once.
newlines = filter(lambda l: l.strip(), f)
newlines = filter(lambda l: not l.strip().startswith('#'), newlines)
num_lines = len(list(newlines))
Note that I renamed the last variable, because a variable name should describe what it is
3. You can combine both your filter condition into a single lambda
lambda l: l.strip() and not l.strip().startswith('#')
or, if you have Python 3.8+,
lambda l: (l1 := l.strip()) and not l1.startswith('#')
This makes my point #2 about not listing out the above moot -
num_lines = len(list(filter(lambda l: (l1 := l.strip()) and l1.startswith('#'), f)))
With the following input, this gives the correct line count:
file.py:
print("Hello World")
# This is a comment
# The next line is blank
print("Bye")
>>> with open('file.py') as f:
... num_lines = len(list(filter(lambda l: (l1 := l.strip()) and l1.startswith('#'), f)))
... print(num_lines)
Out: 2

Related

Why is my code writing something twice while reading a file

I'm working on a code that sends mails to the persons given in the text file.
This is the text file:
X,y#gmail.com
Z,v#gmail.com
This is my code:
with open("mail_list.txt","r",encoding ="utf-8") as file:
a = file.read()
b = a.split("\n")
d = []
for i in b:
c = i.split(",")
d.append(c)
for x in d:
for y in x:
print(x[0])
print(x[1])
The output should be:
X
y#gmail.com
Z
v#gmail.com
But instead it is:
X
y#gmail.com
X
y#gmail.com
Z
v#gmail.com
Z
v#gmail.com
Why is that?
How can I fix it?
You're iterating over the columns in every row, but not using the column value:
for x in d:
for y in x:
print(y)
Please have a look on this solution. I believe this is more elegant and efficient than the current one. Don't just rely on line break splitting. Instead get all the data in form of lines already split by \n(line break) and then use the content as per your requirements.
lines = []
with open('mail_list.txt') as f:
lines = f.readlines()
for line in lines:
info = line.split(',')
print(info[0])
print(info[1])
You need to only iterate on list d.
with open("mail_list.txt", "r", encoding ="utf-8") as file:
a = file.read()
b = a.split("\n")
d = []
for i in b:
c = i.split(",")
d.append(c)
for x in d:
print(x[0])
print(x[1])
To make it simpler, you can read your file line by line and process it at the same time. The strip() method removes any leading (spaces at the beginning) and trailing (spaces or EOL at the end) characters.
with open("mail_list.txt", "r", encoding ="utf-8") as file:
for line in file:
line_s = line.split(",")
print(line_s[0])
print(line_s[1].strip())

Fastest way to convert files into lists?

I have a .txt file which contains some words:
e.g
bye
bicycle
bi
cyc
le
and i want to return a list which contains all the words in the file. I have tried some code which actually works but i think it takes a lot of time to execute for bigger files. Is there a way to make this code more efficient?
with open('file.txt', 'r') as f:
for line in f:
if line == '\n': --> #blank line
lst1.append(line)
else:
lst1.append(line.replace('\n', '')) --> #the way i find more efficient to concatenate letters of a specific word
str1 = ''.join(lst1)
lst_fin = str1.split()
expected output:
lst_fin = ['bye', 'bicycle', 'bicycle']
I don't know if this is more efficient, but at least it's an alternative... :)
with open('file.txt') as f:
words = f.read().replace('\n\n', '|').replace('\n', '').split('|')
print(words)
...or if you don't want to insert a character like '|' (which could be already there) into the data you could do also
with open('file.txt') as f:
words = f.read().split('\n\n')
words = [w.replace('\n', '') for w in words]
print(words)
result is the same in both cases:
# ['bye', 'bicycle', 'bicycle']
EDIT:
I think I have another approach. However, it requires the file not to start with a blank line, iiuc...
with open('file.txt') as f:
res = []
current_elmnt = next(f).strip()
for line in f:
if line.strip():
current_elmnt += line.strip()
else:
res.append(current_elmnt)
current_elmnt = ''
print(words)
Perhaps you want to give it a try...
You can use the iter function with a sentinel of '' instead:
with open('file.txt') as f:
lst_fin = list(iter(lambda: ''.join(iter(map(str.strip, f).__next__, '')), ''))
Demo: https://repl.it/#blhsing/TalkativeCostlyUpgrades
You could use this(I don't know about its efficiency):
lst = []
s = ''
with open('tp.txt', 'r') as file:
l = file.readlines()
for i in l:
if i == '\n':
lst.append(s)
s = ''
elif i == l[-1]:
s += i.rstrip()
lst.append(s)
else:
s+= i.rstrip()
print(lst)

"IndexError: list index out of range" when reading file

Just started learning Python and I'm struggling with this a little.
I'm opening a txt file that will be variable in length and I need to iterate over a user definable amount of lines at a time. When I get to the end of the file I receive the error in the subject field. I've also tried the readlines() function and a couple of variations on the "if" statement that causes the problem. I just can't seem to get the code to find EOF.
Hmm, as I write this, I'm thinking ... do I need to addlist "EOF" to the array and just look for that? Is that the best solution, to find a custom EOF?
My code snippet goes something like:
### variables defined outside of scapy PacketHandler ##
x = 0
B = 0
##########
with open('dict.txt') as f:
lines = list(f)
global x
global B
B = B + int(sys.argv[3])
while x <= B:
while y <= int(sys.argv[2]):
if lines[x] != "":
#...do stuff...
# Scapy send packet Dot11Elt(ID="SSID",info"%s" % (lines[x].strip())
# ....more code...
x = x 1
Let’s say you need to read X lines at a time, put it in a list and process it:
with open('dict.txt') as f:
enoughLines = True
while enoughLines:
lines = []
for i in range(X):
l = f.readline()
if l != '':
lines.append( l )
else:
enoughLines = False
break
if enoughLines:
#Do what has to be done with the list “lines”
else:
break
#Do what needs to be done with the list “lines” that has less than X lines in it
Try a for in loop. You have created your list, now iterate through it.
with open('dict.txt') as f:
lines = list(f)
for item in lines: #each item here is an item in the list you created
print(item)
this way you go through each line of your text file and don't have to worry about where it ends.
edit:
you can do this as well!
with open('dict.txt') as f:
for row in f:
print(row)
The following function will return a generator that returns the next n lines in a file:
def iter_n(obj, n):
iterator = iter(obj)
while True:
result = []
try:
while len(result) < n:
result.append(next(iterator))
except StopIteration:
if len(result) == 0:
raise
yield result
Here is how you can use it:
>>> with open('test.txt') as f:
... for three_lines in iter_n(f, 3):
... print three_lines
...
['first line\n', 'second line\n', 'third line\n']
['fourth line\n', 'fifth line\n', 'sixth line\n']
['seventh line\n']
Contents of test.txt:
first line
second line
third line
fourth line
fifth line
sixth line
seventh line
Note that, because the file does not have a multiple of 3 lines, the last value returned is not 3 lines, but just the rest of the file.
Because this solution uses a generator, it doesn't require that the full file be read into memory (into a list), but iterates over it as needed.
In fact, the above function can iterate over any iterable object, like lists, strings, etc:
>>> for three_numbers in iter_n([1, 2, 3, 4, 5, 6, 7], 3):
... print three_numbers
...
[1, 2, 3]
[4, 5, 6]
[7]
>>> for three_chars in iter_n("1234567", 3):
... print three_chars
...
['1', '2', '3']
['4', '5', '6']
['7']
If you want to get n lines in a list use itertools.islice yielding each list:
from itertools import islice
def yield_lists(f,n):
with open(f) as f:
for sli in iter(lambda : list(islice(f,n)),[]):
yield sli
If you want to use loops, you don't need a while loop at all, you can use an inner loop in range n-1 calling next on the file object with a default value of an empty string, if we get an empty string break the loop if not just append and again yield each list:
def yield_lists(f,n):
with open(f) as f:
for line in f:
temp = [line]
for i in range(n-1):
line = next(f,"")
if not line:
break
temp.append(line)
yield temp

How do I count the occurences of characters of a partition in python?

I have a large file containing sequences; I want to analyze only the last set of characters, which happen to be of variable length. In each line I would like to take the first character and last character of each set in a text file and count the total instances of those characters.
Here is an example of the data in the file:
-1iqd_BA_0_CDRH3.pdb kabat H3 PDPDAFDV
-1iqw_HL_0_CDRH3.pdb kabat H3 NRDYSNNWYFDV
I want to take the first character after the "H3" and the last character (both in bold in example).
The output for these two lines should be:
first Counter({'N': 1, 'P': 1})
last Counter({'V': 2})
This is what I have done so far:
f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
for line in f:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst = right[:grab]
AminoAcidsLast = right[-grab:]
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()
This prints the counts of only the last line of data which looks like:
first Counter({'N': 1})
last Counter({'V': 1})
How do I count all these characters in all lines in the file?
Notes:
Printing (AminoAcidsFirst) or (AminoAcidsLast) gives the desired list of all the lines in vertical but I can't count it or output it to a file. Writing to a new file will only write the characters of the last line of the original file.
Thanks!
No need for Counter: simply grab the last token after spliting and count the first and last characters:
first_counter = {}
last_counter = {}
for line in f:
line=line.split()[-1] # grab the last token
first_counter[line[0]] = first_counter.get(line[0], 0) + 1
last_counter[line[-1]] = last_counter.get(line[-1], 0) + 1
print("first ", first_counter)
print("last ", last_counter)
OUTPUT
first {'P': 1, 'N': 1}
last {'V': 2}
create 2 empty lists and append in each loop like so:
f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
AminoAcidsFirst = []
AminoAcidsLast = []
for line in f:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst.append(right[:grab])
AminoAcidsLast.append(right[-grab:])
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()
Here:
Creation of empty list:
AminoAcidsFirst = []
AminoAcidsLast = []
Appending in each loop:
AminoAcidsFirst.append(right[:grab])
AminoAcidsLast.append(right[-grab:])
Two important things I would like to point out
never reveal path of file on your computer, this is especially applicable if you are from scientific community
your code can be more pythonic using with...as approach
And now the program
from collections import Counter
filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists
with open(filePath, 'rt') as f: # rt not r. Explicit is better than implicit
for line in f:
line = line.rstrip()
left, sep, right = line.partition(" H3 ")
if sep:
AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))
Don't do line.strip()[-1] because sep verification is important
OUTPUT
first {'P': 1, 'N': 1}
last {'V': 2}
Note: Data files can get really large, and you might run into memory issues or computer hanging. So, might I suggest lazy read? Folloing is more robust program
from collections import Counter
filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists
def chunk_read(fileObj, linesCount = 100):
lines = fileObj.readlines(linesCount)
yield lines
with open(filePath, 'rt') as f: # rt not r. Explicit is better than implicit
for aChunk in chunk_read(f):
for line in aChunk:
line = line.rstrip()
left, sep, right = line.partition(" H3 ")
if sep:
AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))
If you put statements at the bottom of or after your for loop to print AminoAcidsFirst and AminoAcidsLast, you will see that on each iteration you are just assigning a new value. Your intent should be to collect, contain or accumulate these values, before feeding them to collections.Counter.
s = ['-1iqd_BA_0_CDRH3.pdb kabat H3 PDPDAFDV', '-1iqw_HL_0_CDRH3.pdb kabat H3 NRDYSNNWYFDV']
An immediate fix for your code would be to accumulate the characters:
grab = 1
AminoAcidsFirst = ''
AminoAcidsLast = ''
for line in s:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst += right[:grab]
AminoAcidsLast += right[-grab:]
print ("first ",collections.Counter(AminoAcidsFirst))
print ("last ",collections.Counter(AminoAcidsLast))
Another approach would be to produce the characters on demand. Define a generator function that will yield the things you want to count
def f(iterable):
for thing in iterable:
left, sep, right = thing.partition(' H3 ')
if sep:
yield right[0]
yield right[-1]
Then feed that to collections.Counter
z = collections.Counter(f(s))
Or using a file as the data source:
with open('myfile.txt') as f1:
# lines is a generator expression
# that produces stripped lines
lines = (line.strip() for line in f1)
z = collections.Counter(f(lines))

I'm making a python script that strips whitespace off of the end of lines, but it is only working on the first line

Like I said in the title, my script only seems to work on the first line.
Here is my script:
#!/usr/bin/python
import sys
def main():
a = sys.argv[1]
f = open(a,'r')
lines = f.readlines()
w = 0
for line in lines:
spot = 0
cp = line
for char in reversed(cp):
x = -1
if char == ' ':
del line[x]
w += 0
if char != '\n' or char != ' ':
lines[spot] = line
spot += 1
break
x += 1
f.close()
f = open(a,'w')
f.writelines(lines)
print("White Space deleted: "+str(w))
if __name__ == "__main__":
main()
I'm not too experienced when it comes to loops.
The following script do the same thing as your program, more compactly:
import fileinput
deleted = 0
for line in fileinput.input(inplace=True):
stripped = line.rstrip()
deleted += len(line) - len(stripped) + 1 # don't count the newline
print(stripped)
print("Whitespace deleted: {}".format(deleted))
Here str.rstrip() removes all whitespace from the end of a line (newlines, spaces and tabs).
The fileinput module takes care of handling sys.argv for you, opening files one by one if you name more than one file.
Using print() will add the newline back on to the end of the stripped lines.
Just use rstrip:
f = open(a,'r')
lines = f.readlines()
f.close()
f = open(a,'w')
for line in lines:
f.write(line.rstrip()+'\n')
f.close()
rstrip() is probably what you want to use to achieve this.
>>> 'Here is my string '.rstrip()
'Here is my string'
A more compact way to iterate backwards over stings is
>>> for c in 'Thing'[::-1]:
print(c)
g
n
i
h
T
[::-1] is slice notation. SLice notaion can be represented as [start:stop:step]. In my example a -1 for the step means it will step form the back by one index. [x:y:z] will start at index x stop at y-1 and go forward by z places each step.

Categories