re.sub emptying list

re.sub emptying list - python

def process_dialect_translation_rules():
# Read in lines from the text file specified in sys.argv[1], stripping away
# excess whitespace and discarding comments (lines that start with '##').
f_lines = [line.strip() for line in open(sys.argv[1], 'r').readlines()]
f_lines = filter(lambda line: not re.match(r'##', line), f_lines)
# Remove any occurances of the pattern '\s*<=>\s*'. This leaves us with a
# list of lists. Each 2nd level list has two elements: the value to be
# translated from and the value to be translated to. Use the sub function
# from the re module to get rid of those pesky asterisks.
f_lines = [re.split(r'\s*<=>\s*', line) for line in f_lines]
f_lines = [re.sub(r'"', '', elem) for elem in line for line in f_lines]
This function should take the lines from a file and perform some operations on the lines, such as removing any lines that begin with ##. Another operation that I wish to perform is to remove the quotation marks around the words in the line. However, when the final line of this script runs, f_lines becomes an empty lines. What happened?
Requested lines of original file:
## English-Geek Reversible Translation File #1
## (Moderate Geek)
## Created by Todd WAreham, October 2009
"TV show" <=> "STAR TREK"
"food" <=> "pizza"
"drink" <=> "Red Bull"
"computer" <=> "TRS 80"
"girlfriend" <=> "significant other"

In Python, multiple for loops in a list comprehension are handled from left to right, not from right to left, so your last expression should read:
[re.sub(r'"', '', elem) for line in f_lines for elem in line]
It doesn't lead to an error as it is, since list comprehensions leak the loop variable, so line is still in scope from the previous expression. If that line then is an empty string you get an empty list as result.

Your basic problem is that you have chosen an over-complicated way of doing things, and come unstuck. Use the simplest tool that will get the job done. You don't need filter, map, lambda, readlines, and all of those list comprehensions (one will do). Using re.match instead of startswith is overkill. So is using re.sub where str.replace would do the job.
with open(sys.argv[1]) as f:
d = {}
for line in f:
line = line.strip()
if not line: continue # empty line
if line.startswith('##'): continue # comment line
parts = line.split('<=>')
assert len(parts) == 2 # or print an error message ...
key, value = [part.strip('" ') for part in parts]
assert key not in d # or print an error message ...
d[key] = value
Bonus extra: You get to check for dodgy lines and duplicate keys.

Related

Efficient way to check for expected semicolon position length-delimited text file. Combining many "or" statements

I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:
AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;
I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:
import glob
path = r'C:\path\*.txt'
for fname in glob.glob(path):
print("Checking file", fname)
with open(fname) as f:
content = f.readlines()
for count, line in enumerate(content):
if (line[2:3]!=";"
or line[4:5]!=";"
or line[10:11]!=";"
# really a lot of continuing entries like these
or line[14:15]!=";"
):
print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)
The code works. No error is thrown and it detects the data row.
My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like
or line[xx:xx]!=";"
I think this is inefficient regarding two points:
It is visually not nice to have these many code lines. I think it could be shortened.
It is logically not efficient to have these many splitted or checks. I think it could be more efficient probably decreasing the runtime.
I search for an efficient solution which:
Improves the readability
Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)
I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.

Just going off of what you've written:
filename = ...
with open(filename) as file:
lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
if any(line[index] != ";" for index in delimiter_indices):
print(f"{filename}: Semicolon expected on line #{line_num}")
If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;; are technically valid.
EDIT: Assuming you have an input file that looks like:
AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;
(Note: the blank line at the end)
My provided solution works fine. I do not see any exceptions or Semicolon expected on line #... outputs.
If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.
You could simply say that every line must meet two criteria to be considered valid:
The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).
All characters at delimiter indices in the current line must be semicolons.
Code:
for line_num, line in enumerate(lines):
is_long_enough = len(line) >= (max(delimiter_indices) + 1)
has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)
if not (is_long_enough and has_correct_semicolons):
print(f"{filename}: Semicolon expected on line #{line_num}")
EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:
is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
print(f"{filename}: Semicolon expected on line #{line_num}")
If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError.
EDIT:
Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines(). This isn't really required, and it's not as cute as using all or any, but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.
import glob
delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1
for path in glob.glob(r"C:\path\*.txt"):
filename = path.name
print(filename.center(32, "-"))
with open(path) as file:
for line_num, line in enumerate(file):
is_long_enough = len(line) >= min_line_length
if not is_long_enough:
print(f"{filename}: Line #{line_num} is too short")
continue
has_correct_semicolons = True
for index in delimiter_indices:
if line[index] != ";":
has_correct_semicolons = False
break
if not has_correct_semicolons:
print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")
print("All files done")

If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:
import re
with open(fname) as f:
for row, line in enumerate(f, 1):
if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)
Regex demo here.
If you don't actually care about the content and only want to check the position of the ;, you can simplify the regex to: r".{2};.;.{5};.{3};"
Demo for the dot regex.

Multiple strings replacement from dictionary

I'm going to create script which will take the key and value from the dictionary and use it for replacement in set of files.
I need replace "foo" to "foo1" if "target_value" found in the file. There are many different foo's. So, I guess dictionary will be suitable for that.
I started from the simple things:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value'):
print(line.replace("foo", "foo1"), end='')
For some reason this script just ignore line.find and replace everything with last line of code.
Could you help?

Python's find command returns -1 if the value is not found, so you need something like:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value') > -1:
line = line.replace("foo", "foo1")
print(line, end='')

The trouble with str.find is that it returns the index at which "target_value" occurs in line, so any integer from 0 to len(line)-len(target_value)-1. That is, unless, "target_value" doesn't exist in line; then str.find returns -1 but the value of bool(-1) is True. In fact, the only time line.find('target_value') is False is when "target_value" is the first part of line.
There are a couple options:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value') != -1:
print(line.replace("foo", "foo1"), end='')
Or:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if 'target_value' in line:
print(line.replace("foo", "foo1"), end='')
The second option is more readable and tends to perform better when line is long and "target_value" doesn't occur in the beginning of line.
>>> timeit('"target_value" in s', setup='s = "".join("foobar baz" for _ in range(100))+"target_value"+ "".join("foobar baz" for _ in range(100))')
0.20444475099975534
>>> timeit('s.find("target_value") != -1', setup='s = "".join("foobar baz" for _ in range(100))+"target_value"+ "".join("foobar baz" for _ in range(100))')
0.30517548999978317

Instead of .find(), you can use if 'target_value' in line: which is more expressive and does not involve return value conventions.
If you have multiple target keywords (and multiple replacements per target), you could build your dictionary in like this
replacements = { 'target_value1': [('foo','foo1'), ('bar','bar1')],
'target_value2': [('oof','oof'), ('rab','rab2')],
'target_value3': [('foobar','foosy'),('barfoo','barry')]}
Then find which target value(s) are present and perform the corresponding replacements:
with open(fileName,'r') as f:
content = f.read()
for target,maps in replacements.items(): # go through target keywords
if target not in content: continue # check target (whole file)
for fromString,toString in maps: # perform replacements for target
content = content.replace(fromString,toString)
# you probably want to save the new content at this point
with open(fileName,'w') as f:
f.write(content)
Note that, in this example, I assumed that the target keyword flags the whole file (not each line). If the target keyword is specific to each line, you will need to break down the content into lines and place the logic inside a loop on lines to perform the replacements on a line by line basis
You don't actually need the replacement data to be a dictionary (this second example with line level single target replacements):
replacements = [ ['target_value1',('foo','foo1'), ('bar','bar1')],
['target_value2',('oof','oof'), ('rab','rab2')],
['target_value3',('foobar','foosy'),('barfoo','barry')]]
with open(fileName,'r') as f:
lines = f.read().split("\n")
for i,line in enumerate(lines): # for each line
for target,*maps in replacements: # go through target keywords
if target not in line: continue # check target (line level)
for fromString,toString in maps: # perform replacements on line
lines[i] = lines[i].replace(fromString,toString)
break # only process one keyword per line
# to save changes
with open(fileName,'w') as f:
f.write("\n".join(lines))

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~

You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]

In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.

I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want

s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]

If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Python: Extract single line from file

Very new, please be nice and explain slowly and clearly. Thanks :)
I've tried searching how to extract a single line in python, but all the responses seem much more complicated (and confusing) than what I'm looking for. I have a file, it has a lot of lines, I want to pull out just the line that starts with #.
My file.txt:
"##STUFF"
"##STUFF"
#DATA 01 02 03 04 05
More lines here
More lines here
More lines here
My attempt at a script:
file = open("file.txt", "r")
splitdata = []
for line in file:
if line.startswith['#'] = data
splitdata = data.split()
print splitdata
#expected output:
#splitdata = [#DATA, 1, 2, 3, 4, 5]
The error I get:
line.startswith['#'] = data
TypeError: 'builtin_function_or_method' object does not support item assignment
That seems to mean it doesn't like my "= data", but I'm not sure how to tell it that I want to take the line that starts with # and save it separately.

Correct the if statement and the indentation,
for line in file:
if line.startswith('#'):
print line

Although you're relatively new, you should start learning to use list comprehension, here is an example on how you can use it for your situation. I explained the details in the comments and the comments are matched to the corresponding order.
splitdata = [line.split() for line in file if line.startswith('#')]
# defines splitdata as a list because comprehension is wrapped in []
# make a for loop to iterate through file
#checks if the line "startswith" a '#'
# note: you should call functions/methods using the () not []
# split the line at spaces if the if startment returns True

That's an if condition that expects predicate statement not the assignment.
if line.startswith('#'):
startswith(...)
S.startswith(prefix[, start[, end]]) -> bool
Return True if S starts with the specified prefix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
prefix can also be a tuple of strings to try.

list comprehension not obeying if statement

This stub of my program
with open(fi) as f:
lines = f.readlines()
lines = [line.split('!')[0] for line in lines if not line.startswith('!') and line.strip()] # Removing line and inline comments
for line in lines:
print line, len(line), bool(not line.startswith('!') and line.strip())
gives me output such as
conduction-guess double_array optional 68 True
valence character optional 68 True
valence-guess double_array optional 68 True
68 False
saturated-bond-length-factor double required 68 True
Shouldn't the line whose bool value is False not be included? I'm lost.
I thought it might be short-circuiting, but flipping the expression doesn't help either.
To clarify, I want my list to be a list of the lines that end with 'True' in the above section of code.

The first not line.startswith('!') and line.strip() is operating on a different set of line values than the second one. The list comprehension causes lines to be replaced with a list of the first !-delimited field of each string in the original value of lines, and it is this new lines that the code then prints out. For example, if lines originally contains the string " !foo", this string will pass the conditional in the comprehension, but only the part before the ! — a single space — will be saved into the new lines list, and a single space will not cause line.strip() to return true.

You could see something like that if you had a line in your file which had an exclamation point somewhere other than in the first column. e.g.:
line = " ! "

line.strip() returns a new string, it's not a boolean.
Updated code:
with open(fi) as f:
lines = [ line.strip() for line in f ]
# remove exclamation points
lines = [ line.split('!')[0] for line in lines
if not line.startswith('!')
]
# 3rd col will always be True
for line in lines:
print line, len(line), bool(not line.startswith('!'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

re.sub emptying list - python

Related

Efficient way to check for expected semicolon position length-delimited text file. Combining many "or" statements

Multiple strings replacement from dictionary

Having problems with strings and arrays

Python: Extract single line from file

list comprehension not obeying if statement

Categories

Resources