list comprehension not obeying if statement - python

This stub of my program
with open(fi) as f:
lines = f.readlines()
lines = [line.split('!')[0] for line in lines if not line.startswith('!') and line.strip()] # Removing line and inline comments
for line in lines:
print line, len(line), bool(not line.startswith('!') and line.strip())
gives me output such as
conduction-guess double_array optional 68 True
valence character optional 68 True
valence-guess double_array optional 68 True
68 False
saturated-bond-length-factor double required 68 True
Shouldn't the line whose bool value is False not be included? I'm lost.
I thought it might be short-circuiting, but flipping the expression doesn't help either.
To clarify, I want my list to be a list of the lines that end with 'True' in the above section of code.

The first not line.startswith('!') and line.strip() is operating on a different set of line values than the second one. The list comprehension causes lines to be replaced with a list of the first !-delimited field of each string in the original value of lines, and it is this new lines that the code then prints out. For example, if lines originally contains the string " !foo", this string will pass the conditional in the comprehension, but only the part before the ! — a single space — will be saved into the new lines list, and a single space will not cause line.strip() to return true.

You could see something like that if you had a line in your file which had an exclamation point somewhere other than in the first column. e.g.:
line = " ! "

line.strip() returns a new string, it's not a boolean.
Updated code:
with open(fi) as f:
lines = [ line.strip() for line in f ]
# remove exclamation points
lines = [ line.split('!')[0] for line in lines
if not line.startswith('!')
]
# 3rd col will always be True
for line in lines:
print line, len(line), bool(not line.startswith('!'))

Related

Python: To combine lines of a text file and skip certain records

I have an input file like below (please note that there may/may not be blank lines in the file
11111*Author Name
22222*Date
11111 01 Var-1
11111 02 Var-2
11111 02 Var-3
Rules to be used:
If asterisk(*) is present at position # 6 of a record then skip the record.
First 6 bytes are sequence number which can be spaces as well. However, the first six bytes whether space or number can be ignored.
Only combine the records where asterisk is not present at position # 6.
Only consider data starting from position 7 in the input file up to positon 72.
Add comma as shown below
Expected Output
01,Var-1,02,Var-2,02,Var-3
Below is the code that I was trying to print the record. However, I was not able to get comma(,) after each text. Some were prefixed with spaces. Can someone please help?
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for lines in file_in:
if "*" not in lines:
lines_new = " ".join(lines.split())
lines_fin = lines_new.replace(' ',',')
print(lines_fin,end=' ')
Assuming you just want to print them one after another (they will still be on separate lines)
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for line in file_in:
if line == "\n": # skip empty lines
continue
if line[5] == "*": #skip if asterix at 6th position
continue
line = line.strip() # remove trailing and starting whitespace
line = line.replace(' ', ',') # replace remaining spaces with commas
print(line, ',')
If you just want them all combined then a better way to do it would be:
with open("D:/Desktop/Files/Myfile.txt","r") as f:
all_lines = f.readlines()
all_lines = [line.strip().replace(" ",",") for line in all_lines if line != "\n" and line[5] != "*"]
all_lines = ",".join(all_lines)
I havent tested this so may have typos!
I think a regex solution would be elegant
You would need to handle the limit of 72 for the length of data, but that should not be a problem.
import re
pattern = r'[\s\d]{6}(.+)'
out = []
with open('combinestrings.txt', 'r') as infile:
for line in infile:
result = re.findall(pattern, line)
if result:
out.append(','.join(result[0].split(' ')))
print(','.join(out))
output:
01,Var-1,02,Var-2,02,Var-3
I would use Python's pathlib as it has some useful capabilities for handling paths and reading text files.
To join items together it is useful to put the items you want to join in a Python list and then use the join method on the list.
I have also changed the logic of how you are splitting the data. When a line is kept, the line is always the first 6 characters removed so these can be sliced off. If you do that first it makes the split on whitespace cleaner as you get the two items you are seeking.
There seemed to be a requirement to truncate the data if it was longer than 72 characters so I limited the line of data to 72 characters.
This is what my test code looked like:
from pathlib import Path
data_file = Path("D:/Desktop/Files/Myfile.txt")
field_size = 72
def combine_file_contents(filename):
combined_data = []
for line in filename.read_text().splitlines():
if line and line[5] != "*":
combined_data.extend(line[6:field_size].split())
return ','.join(combined_data)
if __name__ == '__main__':
expected_output = "01,Var-1,02,Var-2,02,Var-3"
output_data = combine_file_contents(data_file)
print("New Output: ", output_data)
print("Expected Output:", expected_output)
assert output_data == expected_output
This gave the following output when I ran with the test data from the question:
New Output: 01,Var-1,02,Var-2,02,Var-3
Expected Output: 01,Var-1,02,Var-2,02,Var-3

Return value in a quite nested for-loop

I want nested loops to test whether all elements match the condition and then to return True. Example:
There's a given text file: file.txt, which includes lines of this pattern:
aaa:bb3:3
fff:cc3:4
Letters, colon, alphanumeric, colon, integer, newline.
Generally, I want to test whether all lines matches this pattern. However, in this function I would like to check whether the first column includes only letters.
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
However, the function returns False even if all characters in the first column are letters. I would like to ask you for the explanation, too.
Your code evaluates the empty line after your code - hence False :
Your file contains a newline after its last line, hence your code checks the line after your last data which does not fullfill your test- that is why you get False no matter the input:
aaa:bb3:3
fff:cc3:4
empty line that does not start with only letters
You can fix it if you "spezial treat" empty lines if they occur at the end. If you have an empty line in between filled ones you return False as well:
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
import string
def opener(file):
letters = string.ascii_letters
# Opens a file and creates a list of lines
with open(file) as fi:
res = True
empty_line_found = False
for i in fi:
if i.strip(): # only check line if not empty
if empty_line_found: # we had an empty line and now a filled line: error
return False
#Checks whether any characters in the first column is not a letter
if any(j not in letters for j in i.strip().split(':')[0]):
return False # immediately exit - no need to test the rest of the file
else:
empty_line_found = True
return res # or True
print (opener("t.txt"))
Output:
True
If you use
# example with a file that contains an empty line between data lines - NOT ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
or
# example for file that contains empty line after data - which is ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
ff2f:cc3:4
""")
you get: False
Colonoscopy
ASCII, and UNICODE, both define character 0x3A as COLON. This character looks like two dots, one over the other: :
ASCII, and UNICODE, both define character 0x3B as SEMICOLON. This character looks like a dot over a comma: ;
You were consistent in your use of the colon in your example: fff:cc3:4 and you were consistent in your use of the word semicolon in your descriptive text: Letters, semicolon, alphanumeric, semicolon, integer, newline.
I'm going to assume you meant colon (':') since that is the character you typed. If not, you should change it to a semicolon (';') everywhere necessary.
Your Code
Here is your code, for reference:
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
Your Problem
The problem you asked about was the function always returning false. The example you gave included a blank line between the first example and the second. I would caution you to watch out for spaces or tabs in those blank lines. You can fix this by explicitly catching blank lines and skipping over them:
for i in fi:
if i.isspace():
# skip blank lines
continue
Some Other Problems
Now here are some other things you might not have noticed:
You provided a nice comment in your function. That should have been a docstring:
def opener(file):
""" Opens a file and creates a list of lines.
"""
You import string in the middle of your function. Don't do that. Move the import
up to the top of the module:
import string # at top of file
def opener(file): # Not at top of file
You opened the file with open() and never closed it. This is exactly why the with keyword was added to python:
with open(file) as infile:
fi = infile.read().splitlines()
You opened the file, read its entire contents into memory, then split it into lines
discarding the newlines at the end. All so that you could split it by colons and ignore
everything but the first field.
It would have been simpler to just call readlines() on the file:
with open(file) as infile:
fi = infile.readlines()
res = True
for i in fi:
It would have been even easier and even simpler to just iterate on the file directly:
with open(file) as infile:
res = True
for i in infile:
It seems like you are building up towards checking the entire format you gave at the beginning. I suspect a regular expression would be (1) easier to write and maintain; (2) easier to understand later; and (3) faster to execute. Both now, for this simple case, and later when you have more rules in place:
import logging
import re
bad_lines = 0
for line in infile:
if line.isspace():
continue
if re.match(valid_line, line):
continue
logging.warn(f"Bad line: {line}")
bad_lines += 1
return bad_lines == 0
Your names are bad. Your function includes the names file, fi, i, j, and res. The only one that barely makes sense is file.
Considering that you are asking people to read your code and help you find a problem, please, please use better names. If you just replaced those names with file (same), infile, line, ch, and result the code gets more readable. If you restructured the code using standard Python best practices, like with, it gets even more readable. (And has fewer bugs!)

extract the dimensions from the head lines of text file

Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.
Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.
Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'
What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"

Python: Extract single line from file

Very new, please be nice and explain slowly and clearly. Thanks :)
I've tried searching how to extract a single line in python, but all the responses seem much more complicated (and confusing) than what I'm looking for. I have a file, it has a lot of lines, I want to pull out just the line that starts with #.
My file.txt:
"##STUFF"
"##STUFF"
#DATA 01 02 03 04 05
More lines here
More lines here
More lines here
My attempt at a script:
file = open("file.txt", "r")
splitdata = []
for line in file:
if line.startswith['#'] = data
splitdata = data.split()
print splitdata
#expected output:
#splitdata = [#DATA, 1, 2, 3, 4, 5]
The error I get:
line.startswith['#'] = data
TypeError: 'builtin_function_or_method' object does not support item assignment
That seems to mean it doesn't like my "= data", but I'm not sure how to tell it that I want to take the line that starts with # and save it separately.
Correct the if statement and the indentation,
for line in file:
if line.startswith('#'):
print line
Although you're relatively new, you should start learning to use list comprehension, here is an example on how you can use it for your situation. I explained the details in the comments and the comments are matched to the corresponding order.
splitdata = [line.split() for line in file if line.startswith('#')]
# defines splitdata as a list because comprehension is wrapped in []
# make a for loop to iterate through file
#checks if the line "startswith" a '#'
# note: you should call functions/methods using the () not []
# split the line at spaces if the if startment returns True
That's an if condition that expects predicate statement not the assignment.
if line.startswith('#'):
startswith(...)
S.startswith(prefix[, start[, end]]) -> bool
Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.

re.sub emptying list

def process_dialect_translation_rules():
# Read in lines from the text file specified in sys.argv[1], stripping away
# excess whitespace and discarding comments (lines that start with '##').
f_lines = [line.strip() for line in open(sys.argv[1], 'r').readlines()]
f_lines = filter(lambda line: not re.match(r'##', line), f_lines)
# Remove any occurances of the pattern '\s*<=>\s*'. This leaves us with a
# list of lists. Each 2nd level list has two elements: the value to be
# translated from and the value to be translated to. Use the sub function
# from the re module to get rid of those pesky asterisks.
f_lines = [re.split(r'\s*<=>\s*', line) for line in f_lines]
f_lines = [re.sub(r'"', '', elem) for elem in line for line in f_lines]
This function should take the lines from a file and perform some operations on the lines, such as removing any lines that begin with ##. Another operation that I wish to perform is to remove the quotation marks around the words in the line. However, when the final line of this script runs, f_lines becomes an empty lines. What happened?
Requested lines of original file:
## English-Geek Reversible Translation File #1
## (Moderate Geek)
## Created by Todd WAreham, October 2009
"TV show" <=> "STAR TREK"
"food" <=> "pizza"
"drink" <=> "Red Bull"
"computer" <=> "TRS 80"
"girlfriend" <=> "significant other"
In Python, multiple for loops in a list comprehension are handled from left to right, not from right to left, so your last expression should read:
[re.sub(r'"', '', elem) for line in f_lines for elem in line]
It doesn't lead to an error as it is, since list comprehensions leak the loop variable, so line is still in scope from the previous expression. If that line then is an empty string you get an empty list as result.
Your basic problem is that you have chosen an over-complicated way of doing things, and come unstuck. Use the simplest tool that will get the job done. You don't need filter, map, lambda, readlines, and all of those list comprehensions (one will do). Using re.match instead of startswith is overkill. So is using re.sub where str.replace would do the job.
with open(sys.argv[1]) as f:
d = {}
for line in f:
line = line.strip()
if not line: continue # empty line
if line.startswith('##'): continue # comment line
parts = line.split('<=>')
assert len(parts) == 2 # or print an error message ...
key, value = [part.strip('" ') for part in parts]
assert key not in d # or print an error message ...
d[key] = value
Bonus extra: You get to check for dodgy lines and duplicate keys.

Categories