Grouping data by an certain rank in Python - python

I have a file that looks like this. The numbers in the segment are x and y coordinates.The text file
I want to only read the records with rank=1 and store the coordinates in a x and y list. So I need to read and save the rank and the number of points. If the program knows the number of points it knows how much coordinates it have to read and store.
I have already the following code but I am stuck at the point that I don't know how I tell the program that it needs to read the number of points until the new segment.
file = "/Users/yuval/Desktop/test1.txt"
x = []
y = []
with open(file, "r") as f:
for lines in f:
line = lines.split()
if(line[0] == "segment"):
rank = int(line[3])
points = int(line[5])

After your first if block, you can use an additional if block to append to your lists when rank==1.
x = []
y = []
rank = None
with open(file, "r") as f:
for lines in f:
line = lines.strip().split()
if(line[0] == "segment"):
rank = int(line[3])
points = int(line[5])
continue
if rank==1:
x.append(float(line[0]))
y.append(float(line[1]))

Related

Selecting line from file by using "startswith" and "next" commands

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)
The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']
First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.
So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want
You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

How to increase the speed of CSV data matching?

I have a scripts that parse two CSV files and compares the first column from one file with the second column from another file. The problem is those files are big and it takes some time to finish the process. The question is how to improve the speed? I tried to use yield from lines before the for cycle but the problem is then I have convert lines[1:] to list(lines[1:]) as result it makes no sense.
def pk():
with open('way/to/first.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_pk = array[0].replace('"', '')
full_list.append(list_pk)
return full_list
def fk():
with open('way/to/second.csv') as csv_file:
lines = csv_file.readlines()
full_list = []
for line in lines[1:]:
array = line.split(',')
list_fk = array[1].replace('"', '')
full_list.append(list_fk)
return full_list
def res():
f = fk()
p = pk()
for i in f:
if i not in p:
raise AssertionError(f'{i} not found')
Try using python's "set difference" to find the elements in set A that do not have a match in set B:
def res():
fset = set(fk())
pset = set(pk())
print('items in F that are missing from P:')
print(fset - pset)

Import textfile - list index out of range

infile = open("/Users/name/Downloads/points.txt", "r")
line = infile.readline()
while line != "":
line = infile.readline()
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])
I looked around but actually didn't find something helpful for my problem.
I have a .txt file with x (first column) and y (second column) coordinates (see picture).
I want every x and y coordinate separated but when I run my code I always get an ERROR:
x_co = float(wordList[0])
IndexError: list index out of range
Thanks for helping!
filename = "/Users/name/Downloads/points.txt"
with open(filename) as infile:
for line in infile:
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])
with automatically handles file closing
For more such idiomatic ways in Python, read this
Better you can do this way:
infile = open("/Users/name/Downloads/points.txt", "r")
for line in infile:
if line:
wordlist = line.split()
x_co = float(wordlist[0])
y_co = float(wordlist[1])

Python: How to read a text file containing co-ordinates in row-column format into x-y co-ordinate arrays?

I have a text file with numbers stored in the following format:
1.2378 4.5645
6.789 9.01234
123.43434 -121.0212
... and so on.
I wish to read these values into two arrays, one for x co-ordinates and the other for y co-ordinates. Like, so
x[0] = 1.2378
y[0] = 4.5645
x[1] = 6.789
y[1] = 9.01234
... and so on.
How should I go about reading the text file and storing values?
One method:
x,y = [], []
for l in f:
row = l.split()
x.append(row[0])
y.append(row[1])
where f is the file object (from open() for instance)
You could also use the csv library
import csv
with open('filename','r') as f:
reader = csv.reader(f,delimeter=' ')
for row in reader:
x.append(row[0])
y.append(row[1])
And you can also use zip to make it more succinct (though possibly less readable:
x,y = zip(*[l.split() for l in f])
where f is the file object, or
import csv
x,y = zip(*csv.reader(f,delimeter=' '))
again where f is the file object. Not that the last two methods will load the entire file into memory (although if you are using python 3 you can use generator expressions and avoid that).
Read it per lines, and split it using split:
with open('f.txt') as f:
for line in f:
x, y = line.split()
#do something meaningful with x and y
Or if you don't mind with storing the whole list to your computer's memory:
with open('f.txt') as f:
coordinates = [(c for c in line.split()) for line in f]
And if you want to store the xs and ys in separate variables:
xes = []
ys = []
with open('f.txt') as f:
for line in f:
x, y = line.split()
xes.append(x)
ys.append(y)

looping until the number of cells changed is very small

This is a repost because I'm getting weird results. I'm trying to run a simulation loop for cells that change in a cellular automata code that changes land use codes based on their adjacent neighbors. I import text files that create a cell id key = land use code value. I also import a text file with each cell's adjacent neighbors. The first time I run the code, 7509 cells changed land use based on adjacent neighbors land uses. I can comment out the reading the dictionary text file and run it again, then around 5,000 cells changed. Run it again, then even less and so on. What I would like to do is run this in a loop until only 0.0001 of the total cells change, after that break the loop.
I've tried a while loop, but it's not giving me the results I'm looking for. After the first run, the count is correct at 7509. After that the count is 28,476 over and over again. I don't understand why this is happening because the count should go back to zero. Can anyone tell me what I'm doing wrong? Here's the code:
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
#Calculate when to end of one sweep
tot_cells = len(FID)
end_sim = tot_cells
p = 0.0001
#Performs cellular automata rules on land use grid codes
while (end_sim > tot_cells*p):
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
#print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
end_sim = count
print end_sim
Thanks for any help....again! :(
I fixed the code so that the simulations stop after the number of cells changed is less than 0.0001 of total cells. I put the while loop in the wrong place. If anyone is interested, here's the revised code for land use cellular automata.
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
#print FID
#Calculate when to end the simulations (neglible change in land use)
tot_cells = len(FID)
end_sim = tot_cells
p = 0.0001
#Performs cellular automata rules on land use grid codes
while (end_sim > tot_cells*p):
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
end_sim = count
print count
I don't know the type of cellular automata that you are programming so mine it's just a guess but usually cellular automata works by updating a whole phase ignoring updated values until the phase is finished.
When I had unexpected results for simple cellular automata it was because I just forgot to apply the phase to a backup grid, but I was applying it directly to the grid I was working on.
What I mean is that you should have 2 grids, let's call them grid1 and grid2, and do something like
init grid1 with data
while number of generations < total generations needed
calculate grid2 as the next generation of grid1
grid1 = grid2 (you replace the real grid with the buffer)
Altering values of grid1 directly will lead to different results because you will mostly change neighbours of a cell that still has to be updated before having finished the current phase..

Categories