Iterate through a loop to change a conditional statement, python - python

New to programming in general. This is my code
for b in range(LengthSpread):
Strip = ReadSpread[b].rstrip('\n')
SplitData = ReadSpread[b].split(",")
PlotID = SplitData[1]
PlotIDnum = float(PlotID)
if PlotIDnum == 1:
List = SplitData
print List
OpenBlank.writelines('%s\n\n\n\n\n' % List)
Ultimately I want to find data based on changing each plotIDnum in the overall dataset. How would I change the number in the conditional if statement, without physically changing the number. Possibly using a for loop, or a while loop. Can't wrap my mind around it.
This is an example of the inputdata
09Tree #PlotID PlotID
1 1 Tree
2 1 Tree
3 2 Tree
4 2 Tree
6 4 Tree
7 5 Tree
8 5 Tree
9 5 Tree
I want my output to be organized by plotID#, and place each output in either a new spreadsheet or have each unique dataset in a new tab
Thanks for any help

I'm not sure how exactly you would like to organize your files, but maybe you could use the plot ID as part of the file name (or name of the tab or whatever). This way you don't even need the extra loop, for example:
for b in range(length_spread):
data = read_spread[b].rstrip('\n')
splitted = data.split(',')
plot_id = splitted[1] # Can keep it as a string
filename = 'plot_id_' + plot_id + '.file_extension'
spreadsheet = some_open_method(filename, option='append')
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
spreadsheet.close_method()
Perhaps you could also make use of the with statement:
with some_open_method(filename) as spreadsheet:
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
This ensures (if your file-object supports this) that the file is properly closed even if your program encounters an exception during writing to the file.
If you want to use some kind of extra loop I think this is the simplest case, assuming you know all the plot ID's beforehand:
all_ids = [1, 2, 4, 5]
# Note: using plot_id as integer now
for plot_id in all_ids:
filename = 'plot_id_%i.file_extension' % plot_id
spreadsheet = some_open_method(filename, option='write')
for b in range(length_spread):
data = read_spread[b].rstrip('\n')
splitted = data.split(',')
if plot_id == int(splitted[1]):
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
spreadsheet.close_method()

Related

Discover different lines across similar files

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?
If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times
if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

Python: Assign column ids and perform calc depending on column position and id

I'm iterating over a file with only a few hundred lines. The data on each line is tab delimited and is essentially (actually around 50 entries per line but you get the idea)
ID_value ctrl_val_A ctrl_val_B ctrl_val_C cond1_val_A cond1_val_B cond1_val_C cond2_val_A cond2_val_B cond2_val_C
On every line I want to perform a simple calculation for each of the condx_val_y (cond/ctrl). The trick is I only want to calculate using the relevant control (either A, B or C).
I'm not sure of the best (most pythonic) way to do this. I have been pushing the line into a list with line.split('\t') but perhaps list comprehension isn't the best way to go..
I'm sure there is a simple solution, I'm just not using the right search terms or something. Any help would be massively appreciated!!!
Thanks in advance.
Your items are aligned in pairs of three, so you can the modulus operator (%) to select the right value for a condition.
values = [Avalue, Bvalue, Cvalue]
0 % 3 = 0 # values[0 % 3] => select A
1 % 3 = 1 # select B
2 % 3 = 2 # select C
3 % 3 = 3 # select A again
4 % 3 = 4 # select B again...
# ...
Example code:
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
FNAME = ...
for line in open(FNAME):
row = line.split('\t')
values = row[1:4]
conditions = row[4:]
for i, cond in enumerate(conditions):
v = values[i % 3]
# use cond and v to calculate your value

Vector data from a file

I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.
I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)
Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])

For-loop to count differences in lines with python

I have a file filled with lines like this (this is just a small bit of the file):
9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae
The number refers to a cluster, and then it goes 'Genus' 'Species' 'Family'.
What I want to do is write a program that will look through each line and report back to me: a list of the different genera in each cluster, and how many of each of those genera are in the cluster. So I'm interested in cluster number and the first 'word' in each line.
My trouble is that I'm not sure how to get this information. I think I need to use a for-loop, starting at lines that begin with '0.'The output would be a file that looks something like:
Cluster 0: Brucella(2) # Lists cluster, followed by genera in cluster with number, something like that.
Cluster 1: Streptomyces(2)
Cluster 2: Brucella(1)
etc.
Eventually I want to do the same kind of count with the Families in each cluster, and then Genera and Species together. Any thoughts on how to start would be greatly appreciated!
I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:
Cluster 0:
Brucella(2)
melitensis(1)
Brucellaceae(1)
neotomae(1)
Brucellaceae(1)
Streptomyces(1)
neotomae(1)
Brucellaceae(1)
Cluster 1:
Streptomyces(2)
geysiriensis(1)
Streptomycetaceae(1)
minutiscleroticus(1)
Streptomycetaceae(1)
Cluster 2:
Mycobacterium(1)
phocaicum(1)
Mycobacteriaceae(1)
Cluster 7:
Mycobacterium(2)
gastri(1)
Mycobacteriaceae(1)
kansasii(1)
Mycobacteriaceae(1)
Cluster 9:
Hyphomicrobium(2)
facile(2)
Hyphomicrobiaceae(2)
Cluster 10:
Streptomyces(2)
niger(1)
Streptomycetaceae(1)
olivaceiscleroticus(1)
Streptomycetaceae(1)
I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.
# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re
# A global (shame on me) for storing the data we're going to parse from stdin
data = []
# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
# Split lines on spaces (gobbling multiple spaces for robustness)
# and trim whitespace off the beginning and end of input (strip)
entry = re.split("\s+", line.strip())
# Throw the array into my global data array, it'll look like this:
# [ "0", "Brucella", "melitensis", "Brucellaceae" ]
# A lot of this code assumes that the first field is an integer, what
# you call "cluster" in your problem description
data.append(entry)
# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
# start - an integer telling us what line to begin working from so we needn't
# walk the whole tree each time to figure out where we are.
# super - An array that captures where we are in the search. This array
# will have more elements in it as we deepen our traversal of the "tree"
# Initially, it will be []
# In the next ply of the tree, it will be [ '0' ]
# Then something like [ '0', 'Brucella' ] and so on.
# data - The global data structure -- this never mutates after the sort above,
# I could have just used the global directly
def groupedReport(start, super, data):
# Figure out what ply we're on in our depth-first traversal of the tree
depth = len(super)
# Count entries in the super class, starting from "start" index in the array:
count = 0
# For the few records in the data file that match our "super" exactly, we count
# occurrences.
if depth != 0:
for i in range(start, len(data)):
if (data[i][0:depth] == data[start][0:depth]):
count = count + 1
else:
break; # We can stop counting as soon as a match fails,
# because of the way our input data is sorted
else:
count = len(data)
# At depth == 1, we're reporting about clusters -- this is the only piece of
# the algorithm that's not truly abstract, and it's only for presentation
if (depth == 1):
sys.stdout.write("Cluster " + super[0] + ":\n")
elif (depth > 0):
# Every other depth: indent with 4 spaces for every ply of depth, then
# output the unique field we just counted, and its count
sys.stdout.write((' ' * ((depth - 1) * 4)) +
data[start][depth - 1] + '(' + str(count) + ')\n')
# Recursion: we're going to figure out a new depth and a new "super"
# and then call ourselves again. We break out on depth == 4 because
# of one other assumption (I lied before about the abstract thing) I'm
# making about our input data here. This could
# be made more robust/flexible without a lot of work
subsuper = None
substart = start
for i in range(start, start + count):
record = data[i] # The original record from our data
newdepth = depth + 1
if (newdepth > 4): break;
# array splice creates a new copy
newsuper = record[0:newdepth]
if newsuper != subsuper:
# Recursion here!
groupedReport(substart, newsuper, data)
# Track our new "subsuper" for subsequent comparisons
# as we loop through matches
subsuper = newsuper
# Track position in the data for next recursion, so we can start on
# the right line
substart = substart + 1
# First call to groupedReport starts the recursion
groupedReport(0, [], data)
If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:
cat input.txt | python classifier.py
Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.
It is easy to do.
create an empty dict {} to store your result, lets call it 'result'
Loop over the data line by line.
Split the line on space to get 4 elements as per your structure, cluster,genus,species,family
Increment counts of genus inside each cluster key when they are found in the current loop, they have to be set to 1 for the first occurence though.
result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }
Code:
my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae"""
result = {}
for line in my_data.split("\n"):
cluster,genus,species,family = line.split(" ")
result.setdefault(cluster,{}).setdefault(genus,0)
result[cluster][genus] += 1
print(result)
{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}

Matching strings for multiple data set in Python

I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.
out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.
Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.
In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!

Categories