Python Index out of range on Cash flow - python

Having trouble with a code that should read comma separated values out of .txt file, sort into arrays based on negativity, and then plot data.
Here is the code, followed by 2 .txt files, the first one works, but the second one doesn't
#check python is working
print "hello world"
#import ability to plot and use matrices
import matplotlib.pylab as plt
import numpy as np
#declare variables
posdata=[]
negdata=[]
postime=[]
negtime=[]
interestrate=.025
#open file
f= open('/Users/zacharygastony/Desktop/CashFlow_2.txt','r')
data = f.readlines()
#split data into arrays
for y in data:
w= y.split(",")
if float(w[1])>0:
postime.append(int(w[0]))
posdata.append(float(w[1]))
else:
negtime.append(int(w[0]))
negdata.append(float(w[1]))
print "Inflow Total: ", posdata
print "Inflow Time: ", postime
print "Outflow Total: ", negdata
print "Outflow Time: ", negtime
#plot the data
N=len(postime)
M=len(negtime)
ind = np.arange(N+M) # the x locations for the groups
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(ind, posdata+negdata, width, color='r')
# add some
ax.set_ylabel('Cash Amount')
ax.set_title('Cash Flow Diagram')
ax.set_xlabel('Time')
plt.plot(xrange(0,M+N))
plt.show()'
.txt 1______
0,3761.97
1,-1000
2,-1000
3,-1000
4,-1000
.txt 2______
0,1000
1,-1000
2,1000
3,-1000
My error is as follows:
>>> runfile('/Users/zacharygastony/cashflow.py', wdir=r'/Users/zacharygastony')
hello world
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/zacharygastony/anaconda/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/Users/zacharygastony/cashflow.py", line 24, in <module>
if float(w[1])>0:
IndexError: list index out of range

One error that I can spot is with " if float(w[1])>0:" -- it shoudl take into account that the w[1] would be a set of two values separated by a space. Here is how w would look like for the second file: "['0', '1000 1', '-1000 2', '1000 3', '-1000\n']". So, w[1] would be "1000 1" and taking a float for this value would be an error. So, if you really want to access the second element, then one way is to split it using the default space delimiter and pick the first one (or the second one). Something like: "if float((w[1].split())[0])>0:".

Without having your actual files (or, better, an SSCCE that demonstrates the same problem), there's no way to be exactly sure what's going wrong. When I run your code (just changing the hardcoded pathname) with your exact data, everything works fine.
But, if if float(w[1])>0: is raising an IndexError, clearly w has only 0 or 1 elements.
Since w came from w= y.split(","), that means that y didn't have any commas in it.
Since y is each line from your file, one of the lines doesn't have any commas in it.
Which line? Well, none of them in the example you gave.
Most likely, your real file has something like a blank line at the end, so w ends as the single-element list [''].
Or… maybe that 2______ is actually a header line at the top of your file, in which case w will end up as ['2______'].
Or the actual file you're running against is a longer, hand-edited file, where you've made a typo somewhere, like 4.1000 instead of 4,1000.
Or…
To actually figure out the problem instead of just guessing, you will need to debug things, using a debugger or an interactive visualizer, or just adding print statements to log all the intermediate values:
print(y)
w= y.split(",")
print(w)
w1 = w[1]
print(w1)
f = float(w1)
print(f)
if f>0:
# ...
So, your actual problem is blank lines at the end of the file. How can you deal with that?
You can skip over blank lines, or skip over lines without enough commas, or just handle the exception and continue on.
For example, let's skip over blank lines. Note that readlines leaves the newline characters on the end, so they won't actually be blank, they'll be '\n' or maybe, depending on your platform and Python version, something else like '\r\n'. But really, you probably want to skip over a line with nothing but spaces too, right? So, let's just do call strip on it, and if the result is empty, skip the line:
for y in data:
if not y.strip():
continue
w = y.split(",")
If you'd prefer to preprocess things, you can:
data = f.readlines()
data = [line for line in data if line.strip()]
The problem with this is that, on top of reading in the whole file and searching for newlines to split on and building up a big list (all of which you were already doing just by calling readlines), you're also now going over the whole list again and building up another list. And all that before you even get started. And there is no reason to do that.
You can just iterate over a file, without ever calling readlines on it, which will grab the lines as you need them.
And you can use a generator expression instead of a list comprehension to "preprocess" without actually doing the work up-front. So:
data = (line for line in f if line.strip())

Related

Efficient way to check for expected semicolon position length-delimited text file. Combining many "or" statements

I am checking the position of semicolons in text files. I have length-delimited text files having thousands of rows which look like this:
AB;2;43234;343;
CD;4;41234;443;
FE53234;543;
FE;5;53;34;543;
I am using the following code to check the correct position of the semicolons. If a semicolon is missing where I would expect it, a statement is printed:
import glob
path = r'C:\path\*.txt'
for fname in glob.glob(path):
print("Checking file", fname)
with open(fname) as f:
content = f.readlines()
for count, line in enumerate(content):
if (line[2:3]!=";"
or line[4:5]!=";"
or line[10:11]!=";"
# really a lot of continuing entries like these
or line[14:15]!=";"
):
print("\nSemikolon expected, but not found!\nrow:", count+1, "\n", fname, "\n", line)
The code works. No error is thrown and it detects the data row.
My problem now is that I have a lot of semicolons to check and I have really a lot of continuing entries like
or line[xx:xx]!=";"
I think this is inefficient regarding two points:
It is visually not nice to have these many code lines. I think it could be shortened.
It is logically not efficient to have these many splitted or checks. I think it could be more efficient probably decreasing the runtime.
I search for an efficient solution which:
Improves the readability
Most importantly: reduces the runtime (as I think the way it is written now is inefficient, with all the or statements)
I only want to check if there are semicolons where I would expect them. Where I need them. I do not care about any additional semicolons in the data fields.
Just going off of what you've written:
filename = ...
with open(filename) as file:
lines = file.readlines()
delimiter_indices = (2, 4, 10, 14) # The indices in any given line where you expect to see semicolons.
for line_num, line in enumerate(lines):
if any(line[index] != ";" for index in delimiter_indices):
print(f"{filename}: Semicolon expected on line #{line_num}")
If the line doesn't have at least 15 characters, this will raise an exception. Also, lines like ;;;;;;;;;;;;;;; are technically valid.
EDIT: Assuming you have an input file that looks like:
AB;2;43234;343;
CD;4;41234;443;
FE;5;53234;543;
FE;5;53;34;543;
(Note: the blank line at the end)
My provided solution works fine. I do not see any exceptions or Semicolon expected on line #... outputs.
If your input file ends with two blank lines, this will raise an exception. If your input file contains a blank line somewhere in the middle, this will also raise an exception. If you have lines in your file that are less than 15 characters long (not counting the last line), this will raise an exception.
You could simply say that every line must meet two criteria to be considered valid:
The current line must be at least 15 characters long (or max(delimiter_indices) + 1 characters long).
All characters at delimiter indices in the current line must be semicolons.
Code:
for line_num, line in enumerate(lines):
is_long_enough = len(line) >= (max(delimiter_indices) + 1)
has_correct_semicolons = all(line[index] == ';' for index in delimiter_indices)
if not (is_long_enough and has_correct_semicolons):
print(f"{filename}: Semicolon expected on line #{line_num}")
EDIT: My bad, I ruined the short-circuit evaluation for the sake of readability. The following should work:
is_valid_line = (len(line) >= (max(delimiter_indices) + 1)) and (all(line[index] == ';' for index in delimiter_indices))
if not is_valid_line:
print(f"{filename}: Semicolon expected on line #{line_num}")
If the length of the line is not correct, the second half of the expression will not be evaluated due to short-circuit evaluation, which should prevent the IndexError.
EDIT:
Since you have so many files with so many lines and so many semicolons per line, you could do the max(delimiter_indices) calculation before the loop to avoid having calculate that value for each line. It may not make a big difference, but you could also just iterate over the file object directly (which yields the next line each iteration), as opposed to loading the entire file into memory before you iterate via lines = file.readlines(). This isn't really required, and it's not as cute as using all or any, but I decided to turn the has_correct_semicolons expression into an actual loop that iterates over delimiter indices - that way your error message can be a bit more explicit, pointing to the offending index of the offending line. Also, there's a separate error message for when a line is too short.
import glob
delimiter_indices = (2, 4, 10, 14)
max_delimiter_index = max(delimiter_indices)
min_line_length = max_delimiter_index + 1
for path in glob.glob(r"C:\path\*.txt"):
filename = path.name
print(filename.center(32, "-"))
with open(path) as file:
for line_num, line in enumerate(file):
is_long_enough = len(line) >= min_line_length
if not is_long_enough:
print(f"{filename}: Line #{line_num} is too short")
continue
has_correct_semicolons = True
for index in delimiter_indices:
if line[index] != ";":
has_correct_semicolons = False
break
if not has_correct_semicolons:
print(f"{filename}: Semicolon expected on line #{line_num}, character #{index}")
print("All files done")
If you just want to validate the structure of the lines, you can use a regex that is easy to maintain if your requirement changes:
import re
with open(fname) as f:
for row, line in enumerate(f, 1):
if not re.match(r"[A-Z]{2};\d;\d{5};\d{3};", line):
print("\nSemicolon expected, but not found!\nrow:", row, "\n", fname, "\n", line)
Regex demo here.
If you don't actually care about the content and only want to check the position of the ;, you can simplify the regex to: r".{2};.;.{5};.{3};"
Demo for the dot regex.

Reading an nth line of a textfile in python determined from a list

I have a function gen_rand_index that generates a random group of numbers in list format, such as [3,1] or [3,2,1]
I also have a textfile that that reads something like this:
red $1
green $5
blue $6
How do I write a function so that once python generates this list of numbers, it automatically reads that # line in the text file? So if it generated [2,1], instead of printing [2,1] I would get "green $5, red $1" aka the second line in the text file and the first line in the text file?
I know that you can do print(line[2]) and commands like that, but this won't work in my case because each time I am getting a different random number of a line that I want to read, it is not a set line I want to read each time.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
I have this so far, but I am getting this
error: invalid literal for int() with base 10: '[4, 1]'
I also have gotten
TypeError: string indices must be integers
butI have tried replacing str with int and many things like that but I'm thinking the way I'm just approaching this is wrong. Can anyone help me? (I have only been coding for a couple days now so I apologize in advance if this question is really basic)
Okay, let us first get some stuff out of the way
Whenever you access something from a list the thing you put inside the box brackets [] should be an integer, eg: [5]. This tells Python that you want the 5th element. It cannot ["5"] because 5 in this case would be treated as a string
Therefore the line row = str(result[gen_rand_index]) should actually just be row = ... without the call to str. This is why you got the TypeError about list indices
Secondly, as per your description gen_rand_index would return a list of numbers.
So going by that, why don;t you try this
indices_to_pull = gen_rand_index()
file_handle = open("Foodinventory.txt", 'r')
file_contents = file_handle.readlines() # If the file is small and simle this would work fine
answer = []
for index in indices_to_pull:
answer.append(file_contents[index-1])
Explanation
We get the indices of the file lines from gen_rand_index
we read the entire file into memory using readlines()
Then we get the lines we want, Rememebr to subtract 1 as the list is indexed from 0
The error you are getting is because you're trying to index a string variable (line) with a string index (row). Presumably row will contain something like '[2,3,1]'.
However, even if row was a numerical index, you're not indexing what you think you're indexing. The variable line is a string, and it contains (on any given iteration) one line of the file. Indexing this variable will give you a single character. For example, if line contains green $5, then line[2] will yield 'e'.
It looks like your intent is to index into a list of strings, which represent all the lines of the file.
If your file is not overly large, you can read the entire file into a list of lines, and then just index that array:
with open('file.txt') as fp:
lines = fp.readlines()
print(lines[2]).
In this case, lines[2] will yield the string 'blue $6\n'.
To discard the trailing newline, use lines[2].strip() instead.
I'll go line by line and raise some issues.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
Are you sure it is gen_rand_index and not gen_rand_index()? If gen_rand_index is a function, you should call the function. In the code you have, you are not calling the function, instead you are using the function directly as an index.
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
The correct python idiom for opening a file and reading line by line is
with open("Foodinventory.txt.", "r") as f:
for line in f:
...
This way you do not have to close the file; the with clause does this for you automatically.
Now, what you want to do is to print the lines of the file that correspond to the elements in your variable row. So what you need is an if statement that checks if the line number you just read from the file corresponds to the line number in your array row.
with open("Foodinventory.txt", "r") as f:
for i, line in enumerate(f):
if i == row[i]:
print(line)
But this is wrong: it would work only if your list's elements are ordered. That is not the case in your question. So let's think a little bit. You could iterate over your file multiple times, and each time you iterate over it, print out one line. But this will be inefficient: it will take time O(nm) where n==len(row) and m == number of lines in your file.
A better solution is to read all the lines of the file and save them to an array, then print the corresponding indices from this array:
arr = []
with open("Foodinventory.txt", "r") as f:
arr = list(f)
for i in row:
print(arr[i - 1]) # arrays are zero-indiced

Issues reading two text files and calculating values

I have two text files with numbers that I want to do some very easy calculations on (for now). I though I would go with Python. I have two file readers for the two text files:
with open('one.txt', 'r') as one:
one_txt = one.readline()
print(one_txt)
with open('two.txt', 'r') as two:
two_txt = two.readline()
print(two_txt)
Now to the fun (and for me hard) part. I would like to loop trough all the numbers in the second text file and then subtract it with the second number in the first text file.
I have done this (extended the coded above):
with open('two.txt') as two_txt:
for line in two_txt:
print line;
I don't know how to proceed now, because I think that the second text file would need to be converted to string in order do make some parsing so I get the numbers I want. The text file (two.txt) looks like this:
Start,End
2432009028,2432009184,
2432065385,2432066027,
2432115011,2432115211,
2432165329,2432165433,
2432216134,2432216289,
2432266528,2432266667,
I want to loop trough this, ignore the Start,End (first line) and then once it loops only pick the first values before each comma, the result would be:
2432009028
2432065385
2432115011
2432165329
2432216134
2432266528
Which I would then subtract with the second value in one.txt (contains numbers only and no Strings what so ever) and print the result.
There are many ways to do string operations and I feel lost, for instance I don't know if the methods to read everything to memory are good or not.
Any examples on how to solve this problem would be very appreciated (I am open to different solutions)!
Edit: Forgot to point out, one.txt has values without any comma, like this:
102582
205335
350365
133565
Something like this
with open('one.txt', 'r') as one, open('two.txt', 'r') as two:
next(two) # skip first line in two.txt
for line_one, line_two in zip(one, two):
one_a = int(split(line_one, ",")[0])
two_b = int(split(line_two, " ")[1])
print(one_a - two_b)
Try this:
onearray = []
file = open("one.txt", "r")
for line in file:
onearray.append(int(line.replace("\n", "")))
file.close()
twoarray = []
file = open("two.txt", "r")
for line in file:
if line != "Start,End\n":
twoarray.append(int(line.split(",")[0]))
file.close()
for i in range(0, len(onearray)):
print(twoarray[i] - onearray[i])
It should do the job!

Trouble sorting a list with python

I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).

Printing elements out of list

I have a certain check to be done and if the check satisfies, I want the result to be printed. Below is the code:
import string
import codecs
import sys
y=sys.argv[1]
list_1=[]
f=1.0
x=0.05
write_in = open ("new_file.txt", "w")
write_in_1 = open ("new_file_1.txt", "w")
ligand_file=open( y, "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines ) #Remove the newline character from all the pdb file names
ligand_file.close()
ligand_file=open( "unique_count_c_from_ac.txt", "r" ) #Open the receptor.txt file
ligand_lines_1=ligand_file.readlines() # Read all the lines into the array
ligand_lines_1=map( string.strip, ligand_lines_1 ) #Remove the newline character from all the pdb file names
ligand_file.close()
s=[]
for i in ligand_lines:
for j in ligand_lines_1:
j = j.split()
if i == j[1]:
print j
The above code works great but when I print j, it prints like ['351', '342'] but I am expecting to get 351 342 (with one space in between). Since it is more of a python question, I have not included the input files (basically they are just numbers).
Can anyone help me?
Cheers,
Chavanak
To convert a list of strings to a single string with spaces in between the lists's items, use ' '.join(seq).
>>> ' '.join(['1','2','3'])
'1 2 3'
You can replace ' ' with whatever string you want in between the items.
Mark Rushakoff seems to have solved your immediate problem, but there are some other improvements that could be made to your code.
Always use context managers (with open(filename, mode) as f:) for opening files rather than relying on close getting called manually.
Don't bother reading a whole file into memory very often. Looping over some_file.readilines() can be replaced with looping over some_file directly.
For example, you could have used map(string.strip, ligland_file) or better yet [line.strip() for line in ligland_file]
Don't choose names to include the type of the object they refer to. This information can be found other ways.
For exmaple, the code you posted can be simplified to something along the lines of
import sys
from contextlib import nested
some_real_name = sys.argv[1]
other_file = "unique_count_c_from_ac.txt"
with nested(open(some_real_name, "r"), open(other_file, "r")) as ligand_1, ligand_2:
for line_1 in ligand_1:
# Take care of the trailing newline
line_1 = line_1.strip()
for line_2 in ligand_2:
line_2 = line2.strip()
numbers = line2.split()
if line_1 == numbers[1]:
# If the second number from this line matches the number that is
# in the user's file, print all the numbers from this line
print ' '.join(numbers)
which is more reliable and I believe more easily read.
Note that the algorithmic performance of this is far from ideal because of these nested loops. Depending on your need, this could potentially be improved, but since I don't know exactly what data you need to extract to tell you whether you can.
The time this takes currently in my code and yours is O(nmq), where n is the number of lines in one file, m is the number of lines in the other, and q is the length of lines in unique_count_c_from_ac.txt. If two of these are fixed/small, then you have linear performance. If two can grow arbitrarily (I sort of imagine n and m can?), then you could look into improving your algorithm, probably using sets or dicts.

Categories