Writing to a multi dimensional array with split - python

I am trying to use python to parse a text file (stored in the var trackList) with times and titles in them it looks like this
00:04:45 example text
00:08:53 more example text
12:59:59 the last bit of example text
My regular expression (rem) works, I am also able to split the string (i) into two parts correctly (as in I separate times and text) but I am unable to then add the arrays (using .extend) that the split returns to a large array I created earlier (sLines).
f=open(trackList)
count=0
sLines=[[0 for x in range(0)] for y in range(34)]
line=[]
for i in f:
count+=1
line.append(i)
rem=re.match("\A\d\d\:\d\d\:\d\d\W",line[count-1])
if rem:
sLines[count-1].extend(line[count-1].split(' ',1))
else:
print("error on line: "+count)
That code should go through each line in the file trackList, test to see if the line is as expected, if so separate the time from the text and save the result of that as an array inside an array at the index of one less than the current line number, if not print an error pointing me to the line
I use array[count-1] as python arrays are zero indexed and file lines are not.
I use .extend() as I want both elements of the smaller array added to the larger array in the same iteration of the parent for loop.

So, you have some pretty confusing code there.
For instance doing:
[0 for x in range(0)]
Is a really fancy way of initializing an empty list:
>>> [] == [0 for x in range(0)]
True
Also, how do you know to get a matrix that is 34 lines long? You're also confusing yourself with calling your line 'i' in your for loop, usually that would be reserved as a short hand syntax for index, which you'd expect to be a numerical value. Appending i to line and then re-referencing it as line[count-1] is redundant when you already have your line variable (i).
Your overall code can be simplified to something like this:
# load the file and extract the lines
f = open(trackList)
lines = f.readlines()
f.close()
# create the expression (more optimized for loops)
expr = re.compile('^(\d\d:\d\d:\d\d)\s*(.*)$')
sLines = []
# loop the lines collecting both the index (i) and the line (line)
for i, line in enumerate(lines):
result = expr.match(line)
# validate the line
if ( not result ):
print("error on line: " + str(i+1))
# add an invalid list to the matrix
sLines.append([]) # or whatever you want as your invalid line
continue
# add the list to the matrix
sLines.append(result.groups())

Related

I want python to skip the list if it doesn't have more than 3 parts (space separated) while reading a file line

I'm making python read one file that makes all the lines of a file as a space-separated list current I'm facing an issue where I want python to read a list only if the content is more than 3 parts
I'm trying if int(len(list[3])) == 3: then read the 3 parts of the list but the program is giving the error of
`IndexError: list index out of range
it is usually given when I access something that doesn't exist but the line of code that read the 3rd part shouldn't run on a list without 3+ parts
You are probably looking for:
if len(list) > 3:
# read list
You don't need to convert len() to an int - it already is an int
list[3] gives you back fourth element of the list, you need to pass whole list object to len function
int == 3 will only catch numbers equal to 3, while you wanted all numbers above 3
I think is this:
def get_file_matrix(file:str):
with open(file,'r') as arq:
lines = arq.readlines()
#saniting
lines_clear = list(map(lambda line:line.strip(),lines))
lines_splited = list(map(lambda line:line.split(' '),lines_clear))
lines_filtered = list(filter(lambda line: len(line) <= 3,lines_splited ) )
return lines_filtered
r = get_file_matrix('test.txt')
print(r)

How to split chunks of xy data into lists between isalpha() and newline \n

So I've got a cleaned up datafile of number strings, representing coordinates for polygons. I've had experience assigning one polygon's data in a datafile into a column and plotting it in numpy/matplotlib, but for this I have to plot multiple polygons from one datafile separated by headers. The data isn't evenly sized either; every header has several lines of data in two columns, but not the same amount of lines.
i.e. I've used .readlines() to go from:
# Title of the polygons
# a comment on the datasource
# A comment on polygon projection
Poly Two/a bit
(331222.6210000003, 672917.1531000007)
(331336.0946000004, 672911.7816000003)
(331488.4949000003, 672932.4191999994)
##etc
Poly One
[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005), (##etc)]
to:
PolyOneandTwo
319547.04899999965,673790.8118999992
319553.2614000002,673762.4122000001
319583.4143000003,673608.7760000005
319623.6182000004,673600.1608000007
319685.3598999996,673600.1608000007
##etc
PolyTwoandabit
319135.9966000002,673961.9215999991
319139.7357999999,673918.9201999996
319223.0153000001,673611.6477000006
319254.6040000003,673478.1133999992
##etc etc
PolyOneHundredFifty
##etc
My code so far involves cleaning the original dataset up to make it like you see above;
data_easting=[]
data_northing=[]
County = open('counties.dat','r')
for line in County.readlines():
if line.lstrip().startswith('#'):
print ('Comment line ignored and leading whitespace removed')
continue
line = line.replace('/','and').replace(' ','').replace('[','').replace(']','').replace('),(','\n')
line = line.strip('()\n')
print (line)
if line.isalpha():
print ('Skipped header: '+ line)
continue
I've been using isalpha(): to ignore the headers for each polygon so far, and I was planning on using if line == '\n': continue and line.split(',') to ignore the newlines between data and begin splitting the Easting and Northing lists. I've already got the numpy and matplotlib section of the code (not shown) sorted to make 1 polygon, but I don't know how to implement it to plot multiple arrays/multiple polygons.
I realised early on though that if I tried to assign all the data to the 2 x and y lists, that would just make one large unending polygon that will make a spaghetti mess of my plot as imaginary lines will be drawn to connect them up.
I want to use the isalpha() section to instead identify and make a Dictionary or List of the polygon names, and attach an array for each polygon datablock to that, but I'm not sure of how to implement it (or if you even can). I'm also not certain how to make it stop loading data into a list at the end of a polygon datablock (maybe if line == '\n': break? but how to make it start and stop again 149 more times for each other chunk?).
To make it a bit more difficult, there is 150 polygons with x and y data in this file, so making 150 x and y lists for each individual polygon and writing specific code for each wouldn't be very efficient.
So, how do I basically do:
if line.isalpha():
#(assign to a Counties dictionary or a list as PolyOne, PolyTwo,...PolyOneHundredFifty)
#(a way of getting the data between the header and newline into a separate list)
#(a way to relate that PolyOne Data list of x and y to the dictionary "PolyOne")
if line == '\n':
#(break? continue?)
#(then restart and repeat for PolyTwo,...PolyOneHundredFifty)
line.split (',')
data_easting.append(x) #x1,x2,...x150?
data_northing.append(y) #y1,y2,y150?)
Is there a way of doing what I intend? How would I go about that without pandas?
Thanks for your time.
Parsing the raw data/file:
When you encounter a line/block like the second in your example,
>>> s = '''[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005)]'''
it can be converted directly to a 2d numpy array as follows using ast.literal_eval which is a safe way to convert text to a python object - in this case a list of tuples.
>>> import numpy as np
>>> import ast
>>> if s.startswith('['):
#print(ast.literal_eval(s))
array = np.array(ast.literal_eval(s))
>>> array
array([[331393.1566, 671982.1393],
[331477.2884, 671959.8816],
[331602.1017, 671926.8433],
[331767.2816, 671894.7274],
[331767.2853, 671894.7267]])
>>> array.shape
(5, 2)
For the blocks that resemble the first in your (raw) example accumulate each line as a tuple of floats in a list, when you reach the next block make an array of that list and reset it. I put this all in a generator function which yields blocks as 2-d arrays.
import ast
import numpy as np
def parse(lines_read):
data = []
for line in lines_read:
if line.startswith('#'):
continue
elif line.startswith('('):
n,e = line[1:-2].split(',')
data.append((float(n),float(e)))
elif line.startswith('['):
array = np.array(ast.literal_eval(line))
yield array
else:
if data:
array = np.array(data)
data = []
yield array
Used like this
>>> for block in parse(f.readlines()):
... print(block)
... print('*******************')
[[331222.621 672917.1531]
[331336.0946 672911.7816]
[331488.4949 672932.4192]]
*******************
[[331393.1566 671982.1393]
[331477.2884 671959.8816]
[331602.1017 671926.8433]
[331767.2816 671894.7274]
[331767.2853 671894.7267]]
*******************
>>>
If you need to select the northing or easting columns separately, consult the Numpy docs.
Parsing with two regular expressions. This operates on the whole file read as a string: s = fileobject.read(). It needs to go over the file twice and does not preserve the block order.
import re, ast
import numpy as np
pattern1 = re.compile(r'(\n\([^)]+\))+')
pattern2 = re.compile(r'^\[[^]]+\]',flags=re.MULTILINE)
for m in pattern1.finditer(s):
block = m.group().strip().split('\n')
data = []
for line in block:
line = line[1:-1]
n,e = map(float,line.split(','))
data.append((n,e))
print(np.array(data))
print('****')
for m in pattern2.finditer(s):
print(np.array(ast.literal_eval(m.group())))
print('****')

Indexing error after removing line from 2D array

I am facing an 'List Index out of range' error when trying to iterate a for-loop over a table I've created from a CSV extract, but cannot figure out why - even after trying many different methods.
Here is the step by step description of how the error happens :
I'm removing the first line of an imported CSV file, as this
line contains the columns' names but no data. The CSV has the following structure.
columnName1, columnName2, columnName3, columnName4
This, is, some, data
I, have, in, this
very, interesting, CSV, file
After storing the CSV in a first array called oldArray, I want to populate a newArray that will get all values from oldArray but not the first line, which is the column name line, as previously
mentioned. My newArray should then look like this.
This, is, some, data
I, have, in, this
very, interesting, CSV, file
To create this newArray, I'm using the following code with the append() function.
tempList = []
newArray = []
for i in range(len(oldArray)):
if i > 0: #my ugly way of skipping line 0...
for j in range(len(oldArray[0])):
tempList.append(oldArray[i][j])
newArray.append(tempList)
tempList = []
I also stored the columns in their own separate list.
i = 0
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
And the error comes up next : I now want to populate a treeview table from this newArray, using a for-loop and insert (in a function). But I always get the 'Index List out of range error' and I cannot figure out why.
def populateTable(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree
Error message --> " File "(...my working directory...)", line 301, in populateTable
my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])]))
IndexError: list index out of range "
Using that same function with different datasets and columns worked fine, but not for this here newArray.
I'm fairy certain that the error comes strictly from this 'newArray' and is not linked to another parameter.
I've tested the validity of the columns list, of the CSV import in oldArray through some print() functions, and everything seems normal - values, row dimension, column dimension.
This is a great mystery to me...
Thank you all very much for your help and time.
You can find a problem from your error message: File "(...my working directory...)", line 301, in populateTable my_tree.insert(parent='', index='end', text=data[i][0], values=(data[i][1:len(data[0])])) IndexError: list index out of range
It means there is an index out of range in line 301: data[i][0] or data[i][1:len(data[0])]
(i is over len(data)) or (0 or 1 is over len(data[0]))
My guess is there is some empty list in data(maybe data[-1]?).
if data[i] is [] or [some_one_item], then data[i][1:len(data[0])] try to access to second item which not exists.
there is no problem in your "ugly" way to skip line 0 but I recommend having a look on this way
new_array = old_array.copy()
new_array.remove(new_array[0])
now for fixing your issue
looks like you have a problem in the indexing
when you use a for loop using the range of the length of an array you use normal indexing which starts from one while you identify your i variable to be zero
to make it simple
len(oldArray[0])
this is equal to 4 so when you use it in the for loop it's just like saying
for i in range(4):
to fix this you can either subtract 1 from the length of the old array or just identify the i variable to be 1 at the first
i = 1
for i in range(len(oldArray[0])):
my_columnList[i] = oldArray[0][i]
or
i = 0
for i in range(len(oldArray[0])-1):
my_columnList[i] = oldArray[0][i]
this mistake is also repeated in your populateTree function
so in the same way your code would be
def populateTree(my_tree, newArray, my_columnList):
i = 0
for i in range(len(newArray)-1):
my_tree.insert('','end', text=newArray[i][0], values = (newArray[i][1:len(newArray[0]))
#(im using the text option to bypass treeview's column 0 problem)
return my_tree

Using an if statement to pass through variables ot further functions for python

I am a biologist that is just trying to use python to automate a ton of calculations, so I have very little experience.
I have a very large array that contains values that are formatted into two columns of observations. Sometimes the observations will be the same between the columns:
v1,v2
x,y
a,b
a,a
x,x
In order to save time and effort I wanted to make an if statement that just prints 0 if the two columns are the same and then moves on. If the values are the same there is no need to run those instances through the downstream analyses.
This is what I have so far just to test out the if statement. It has yet to recognize any instances where the columns are equivalen.
Script:
mylines=[]
with open('xxxx','r') as myfile:
for myline in myfile:
mylines.append(myline) ##reads the data into the two column format mentioned above
rang=len(open ('xxxxx,'r').readlines( )) ##returns the number or lines in the file
for x in range(1, rang):
li = mylines[x] ##selected row as defined by x and the number of lines in the file
spit = li.split(',',2) ##splits the selected values so they can be accessed seperately
print(spit[0]) ##first value
print(spit[1]) ##second value
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Output:
192Alhe52
192Alhe52
Issue ##should be 0
188Alhe48
192Alhe52
Issue
191Alhe51
192Alhe52
Issue
How do I get python to recgonize that certain observations are actually equal?
When you read the values and store them in the array, you can be storing '\n' as well, which is a break line character, so your array actually looks like this
print(mylist)
['x,y\n', 'a,b\n', 'a,a\n', 'x,x\n']
To work around this issue, you have to use strip(), which will remove this character and occasional blank spaces in the end of the string that would also affect the comparison
mylines.append(myline.strip())
You shouldn't use rang=len(open ('xxxxx,'r').readlines( )), because you are reading the file again
rang=len(mylines)
There is a more readable, pythonic way to replicate your for
for li in mylines[1:]:
spit = li.split(',')
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Or even
for spit.split(',') in mylines[1:]:
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
will iterate on the array mylines, starting from the first element.
Also, if you're interested in python packages, you should have a look at pandas. Assuming you have a csv file:
import pandas as pd
df = pd.read_csv('xxxx')
for i, elements in df.iterrows():
if elements['v1'] == elements['v2']:
print('Equal')
else:
print('Different')
will do the trick. If you need to modify values and write another file
df.to_csv('nameYouWant')
For one, your issue with the equals test might be because iterating over lines like this also yields the newline character. There is a string function that can get rid of that, .strip(). Also, your argument to split is 2, which splits your row into three groups - but that probably doesn't show here. You can avoid having to parse it yourself when using the csv module, as your file presumably is that:
import csv
with open("yourfile.txt") as file:
reader = csv.reader(file)
next(reader) # skip header
for first, second in reader:
print(first)
print(second)
if first == second:
print(0)
else:
print("Issue")

Reading multiple files and arrays

I need to read the values from text files into an arrays, Z. This works fine using just a single file, ChiTableSingle, but when i try to use multiple files it fails. It seems to be reading lines correctly, and produces Z, but gives z[0] as just [], then i get the error, setting an array element with a sequence.
This is my current code:
rootdir='C:\users\documents\ChiGrid'
fileNameTemplate = r'C:\users\documents\ContourPlots\Plot{0:02d}.png'
for subdir,dirs,files in os.walk(rootdir):
for count, file in enumerate(files):
fh=open(os.path.join(subdir,file),'r')
#fh = open( "ChiTableSingle.txt" );
print 'file is '+ str(file)
Z = []
for line in fh.readlines():
y = [value for value in line.split()]
Z.append( y )
print Z[0][0]
fh.close()
plt.figure() # Create a new figure window
Temp=open('TempValues.txt','r')
lineTemp=Temp.readlines()
for i in range(0, len(lineTemp)):
lineTemp[i]=[float(lineTemp[i])]
Grav=open('GravValues2.txt','r')
lineGrav=Grav.readlines()
for i in range(0, len(lineGrav)):
lineGrav[i]=[float(lineGrav[i])]
X,Y = np.meshgrid(lineTemp, lineGrav) # Create 2-D grid xlist,ylist values
plt.contour(X, Y, Z,[1,2,3], colors = 'k', linestyles = 'solid')
plt.savefig(fileNameTemplate.format(count), format='png')
plt.clf()
The first thing I noticed is that your list comprehension y = [value for ...] is only going to return a list of strings (from the split() function), so you will want to convert them to a numeric format at some point before trying to plot it.
In addition, if the files you are reading in are simply white-space delimited tables of numbers, you should consider using numpy.loadtxt(fh) since it takes care of splitting and type conversion and returns a 2-d numpy.array. You can also add comment text that it will ignore if the line starts with the regular python comment character (e.g. # this line is a comment and will be ignored).
Just another thought, I would be careful about using variable names that are the same as a python method (e.g. the word file in this case). Once you redefine it as something else, the previous definition is gone.

Categories