Vector data from a file - python

I am just starting out with Python. I have some fortran and some Matlab skills, but I am by no means a coder. I need to post-process some output files.
I can't figure out how to read each value into the respective variable. The data looks something like this:
h5097600N1 2348.13 2348.35 -0.2219 20.0 -4.438
h5443200N1 2348.12 2348.36 -0.2326 20.0 -4.651
h8467200N2 2348.11 2348.39 -0.2813 20.0 -5.627
...
In my limited Matlab notation, I would like to assign the following variables of the form tN1(i,j) something like this:
tN1(1,1)=5097600; tN1(1,2)=5443200; tN2(1,3)=8467200; #time between 'h' and 'N#'
hmN1(1,1)=2348.13; hmN1(1,2)=2348.12; hmN2(1,3)=2348.11; #value in 2nd column
hsN1(1,1)=2348.35; hsN1(1,2)=2348.36; hsN2(1,3)=2348.39; #value in 3rd column
I will have about 30 sets, or tN1(1:30,1:j); hmN1(1:30,1:j);hsN1(1:30,1:j)
I know it may not seem like it, but I have been trying to figure this out for 2 days now. I am trying to learn this on my own and it seems I am missing something fundamental in my understanding of python.

I wrote a simple script which does what you asks. It creates three dictionaries, t, hm and hs. These will have keys as the N values.
import csv
import re
path = 'vector_data.txt'
# Using the <with func as obj> syntax handles the closing of the file for you.
with open(path) as in_file:
# Use the csv package to read csv files
csv_reader = csv.reader(in_file, delimiter=' ')
# Create empty dictionaries to store the values
t = dict()
hm = dict()
hs = dict()
# Iterate over all rows
for row in csv_reader:
# Get the <n> and <t_i> values by using regular expressions, only
# save the integer part (hence [1:] and [1:-1])
n = int(re.findall('N[0-9]+', row[0])[0][1:])
t_i = int(re.findall('h.+N', row[0])[0][1:-1])
# Cast the other values to float
hm_i = float(row[1])
hs_i = float(row[2])
# Try to append the values to an existing list in the dictionaries.
# If that fails, new lists is added to the dictionaries.
try:
t[n].append(t_i)
hm[n].append(hm_i)
hs[n].append(hs_i)
except KeyError:
t[n] = [t_i]
hm[n] = [hm_i]
hs[n] = [hs_i]
Output:
>> t
{1: [5097600, 5443200], 2: [8467200]}
>> hm
{1: [2348.13, 2348.12], 2: [2348.11]}
>> hn
{1: [2348.35, 2348.36], 2: [2348.39]}
(remember that Python uses zero-indexing)

Thanks for all your comments. Suggested readings led to other things which helped. Here is what I came up with:
if len(line) >= 45:
if line[0:45] == " FIT OF SIMULATED EQUIVALENTS TO OBSERVATIONS": #! indicates data to follow, after 4 lines of junk text
for i in range (0,4):
junk = file.readline()
for i in range (0,int(nobs)):
line = file.readline()
sline = line.split()
obsname.append(sline[0])
hm.append(sline[1])
hs.append(sline[2])

Related

Using an if statement to pass through variables ot further functions for python

I am a biologist that is just trying to use python to automate a ton of calculations, so I have very little experience.
I have a very large array that contains values that are formatted into two columns of observations. Sometimes the observations will be the same between the columns:
v1,v2
x,y
a,b
a,a
x,x
In order to save time and effort I wanted to make an if statement that just prints 0 if the two columns are the same and then moves on. If the values are the same there is no need to run those instances through the downstream analyses.
This is what I have so far just to test out the if statement. It has yet to recognize any instances where the columns are equivalen.
Script:
mylines=[]
with open('xxxx','r') as myfile:
for myline in myfile:
mylines.append(myline) ##reads the data into the two column format mentioned above
rang=len(open ('xxxxx,'r').readlines( )) ##returns the number or lines in the file
for x in range(1, rang):
li = mylines[x] ##selected row as defined by x and the number of lines in the file
spit = li.split(',',2) ##splits the selected values so they can be accessed seperately
print(spit[0]) ##first value
print(spit[1]) ##second value
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Output:
192Alhe52
192Alhe52
Issue ##should be 0
188Alhe48
192Alhe52
Issue
191Alhe51
192Alhe52
Issue
How do I get python to recgonize that certain observations are actually equal?
When you read the values and store them in the array, you can be storing '\n' as well, which is a break line character, so your array actually looks like this
print(mylist)
['x,y\n', 'a,b\n', 'a,a\n', 'x,x\n']
To work around this issue, you have to use strip(), which will remove this character and occasional blank spaces in the end of the string that would also affect the comparison
mylines.append(myline.strip())
You shouldn't use rang=len(open ('xxxxx,'r').readlines( )), because you are reading the file again
rang=len(mylines)
There is a more readable, pythonic way to replicate your for
for li in mylines[1:]:
spit = li.split(',')
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
Or even
for spit.split(',') in mylines[1:]:
if spit[0] == spit[1]:
print(0)
else:
print('Issue')
will iterate on the array mylines, starting from the first element.
Also, if you're interested in python packages, you should have a look at pandas. Assuming you have a csv file:
import pandas as pd
df = pd.read_csv('xxxx')
for i, elements in df.iterrows():
if elements['v1'] == elements['v2']:
print('Equal')
else:
print('Different')
will do the trick. If you need to modify values and write another file
df.to_csv('nameYouWant')
For one, your issue with the equals test might be because iterating over lines like this also yields the newline character. There is a string function that can get rid of that, .strip(). Also, your argument to split is 2, which splits your row into three groups - but that probably doesn't show here. You can avoid having to parse it yourself when using the csv module, as your file presumably is that:
import csv
with open("yourfile.txt") as file:
reader = csv.reader(file)
next(reader) # skip header
for first, second in reader:
print(first)
print(second)
if first == second:
print(0)
else:
print("Issue")

Python How do I sum the data values

Basically currently my program reads the Data file (electric info), sums the values up, and after summing the values, it changes all the negative numbers to 0, and keeps the positive numbers as they are. The program does this perfectly. This is the code I currently have:
import csv
from datetime import timedelta
from collections import defaultdict
def convert(item):
try:
return float(item)
except ValueError:
return 0
sums = defaultdict(list)
def daily():
lista = []
with open('Data.csv', 'r') as inp:
reader = csv.reader(inp, delimiter = ';')
headers = next(reader)
for line in reader:
mittaus = max(0,sum([convert(i) for i in line[1:-2]]))
lista.append()
#print(line[0],mittaus) ('#'only printing to check that it works ok)
daily()
My question is: How can I save the data to lists, so I can use them, and add all the values per day, so should look something like this:
1.1.2016;358006
2.1.2016;39
3.1.2016;0 ...
8.1.2016;239143
After had having these in a list (to save later on to a new data file), it should calculate the cumulative values straight after, and should look like this:
1.1.2016;358006
2.1.2016;358045
3.1.2016;358045...
8.1.2016;597188
Having done these, it should be ready to write these datas to a new csv file.
Small peak what's behind the Data file: https://pastebin.com/9HxwcixZ [It's actually divided with ';' , not with ' ' as in the pastebin]
The data file: https://files.fm/u/yuf4bbuk
I have clarified the questions, so you might have seen me ask before. These should be done without external libraries. I hope to find some help.

Add the value of two different specific csv columns

So I currently get a .csv file that looks like this:
HostType,Number
Windows_Desktop,84
Linux_Desktop,12
Windows_Desktop,60
Linux_Desktop,7
I am trying to write a script that performs a function based on the total value. So I have a two global variables:
WINDOWS = 0
LINUX = 0
I am trying to make it so that the function adds the two Window_Desktop numbers together, and Linux_Desktop numbers together. So something like..
def count_function():
global WINDOWS
global LINUX
count_file = open('counts.csv', 'rb')
reader = csv.reader(count_file)
WINDOWS = float(row[2]) + float(otherrow[2])
LINUX = float(row[2]) + float(otherrow[2])
(I know this is very wrong syntax, just a brief example of what I am trying to figure out)
But I don't know how to specify column and row I want to add together. They are always in the same place. Windows is always 2 and 4, Linux is always in 3 and 5. So I don't need to regex them by any means. I am just trying to figure out how to do Row 2 Column 2 + Row 4 Column 2.
Basically, I am ultimately trying to do something like:
if WINDOWS < 80
some_function()
Although I have that part figured out, its getting the numbers to add up that I can't seem to figure out despite how many times I bash my head.
You need to identify the type of thing you are collecting by analyzing the contents of the first column. Since you are collecting Windows and Linux totals, you can use a dictionary to collect these data.
Try this version:
import csv
from collections import defaultdict
data = defaultdict(float) # this just means, the default value of a key
# that doesn't exit is a float
with open('yourfile.csv') as f:
reader = csv.reader(f)
next(f) # This will skip the header
for row in reader:
data[row[0].split('_')[0].strip()] += float(row[1])
if data['Windows'] < 80:
print('Do stuff')
for key, value in data.iteritems():
print('Value for {} is {}'.format(key, value))
I would highly recommend using the Pandas package. It is very useful for working with csv files.
import pandas as pd
df = pd.read_csv("/Users/daddy30000/Dropbox/Stackoverflow/16_06_20_example.csv")
windows = df[df['HostType'] == 'Windows_Desktop'].sum()[1]
linux = df[df['HostType'] == 'Linux_Desktop'].sum()[1]
print windows
>>> 144
print linux
>>> 19
Note that I am assuming all your Windows rows have the same spelling, 'Windows_Desktop'. You use two different spellings in the example.
One way you can do it is like so:
with open("/tmp/foo.txt", 'r') as input_file:
counts = {}
for line in input_file:
split_line = line.split(",")
device = split_line[0]
counts[device] = int(split_line[1]) + (counts.get(device) or 0)
print counts ## prints {'Windows_Desktop': 144, 'Linux_Desktop': 19}
There are many ways, but this one doesn't require imports or downloading anything new to Python
For such as small dataset, I'd read the whole thing into memory and use indices (slightly different from yours) to directly access the proper rows and columns. I also see no need for using global variables (or why you're using float instead of int):
import csv
def count_desktops(filename):
with open(filename, 'rb') as count_file:
data = list(csv.reader(count_file))
windows = float(data[1][1]) + float(data[3][1])
linux = float(data[2][1]) + float(data[4][1])
return windows, linux
windows, linux = count_desktops('counts.csv')
if windows < 80:
some_function()

Python script for trasnforming ans sorting columns in ascending order, decimal cases

I wrote a script in Python removing tabs/blank spaces between two columns of strings (x,y coordinates) plus separating the columns by a comma and listing the maximum and minimum values of each column (2 values for each the x and y coordinates). E.g.:
100000.00 60000.00
200000.00 63000.00
300000.00 62000.00
400000.00 61000.00
500000.00 64000.00
became:
100000.00,60000.00
200000.00,63000.00
300000.00,62000.00
400000.00,61000.00
500000.00,64000.00
10000000 50000000 60000000 640000000
This is the code I used:
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_new.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = s.split()
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
s = input.readline()
input.close()
output.close()
I need to change the above code to also transform the coordinates from two decimal to one decimal values and each of the two new columns to be sorted in ascending order based on the values of the x coordinate (left column).
I started by writing the following but not only is it not sorting the values, it is placing the y coordinates on the left and the x on the right. In addition I don't know how to transform the decimals since the values are strings and the only function I know is using %f and that needs floats. Any suggestions to improve the code below?
import string
input = open(r'C:\coordinates.txt', 'r')
output = open(r'C:\coordinates_sorted.txt', 'wb')
s = input.readline()
while s <> '':
s = input.readline()
liste = string.split(s)
x = liste[0]
y = liste[1]
output.write(str(x) + ',' + str(y))
output.write('\n')
sorted(s, key=lambda x: x[o])
s = input.readline()
input.close()
output.close()
thanks!
First, try to format your code according to PEP8—it'll be easier to read. (I've done the cleanup in your post already).
Second, Tim is right in that you should try to learn how to write your code as (idiomatic) Python not just as if translated directly from its C equivalent.
As a starting point, I'll post your 2nd snippet here, refactored as idiomatic Python:
# there is no need to import the `string` module; `.strip()` is a built-in
# method of strings (i.e. objects of type `str`).
# read in the data as a list of pairs of raw (i.e. unparsed) coordinates in
# string form:
with open(r'C:\coordinates.txt') as in_file:
coords_raw = [line.strip().split() for line in in_file.readlines()]
# convert the raw list into a list of pairs (2-tuples) containing the parsed
# (i.e. float not string) data:
coord_pairs = [(float(x_raw), float(y_raw)) for x_raw, y_raw in coords_raw]
coord_pairs.sort() # you want to sort the entire data set, not just values on
# individual lines as in your original snippet
# build a list of all x and y values we have (this could be done in one line
# using some `zip()` hackery, but I'd like to keep it readable (for you at
# least)):
all_xs = [x for x, y in coord_pairs]
all_ys = [y for x, y in coord_pairs]
# compute min and max:
x_min, x_max = min(all_xs), max(all_xs)
y_min, y_max = min(all_ys), max(all_ys)
# NOTE: the above section performs well for small data sets; for large ones, you
# should combine the 4 lines in a single for loop so as to NOT have to read
# everything to memory and iterate over the data 6 times.
# write everything out
with open(r'C:\coordinates_sorted.txt', 'wb') as out_file:
# here, we're doing 3 things in one line:
# * iterate over all coordinate pairs and convert the pairs to the string
# form
# * join the string forms with a newline character
# * write the result of the join+iterate expression to the file
out_file.write('\n'.join('%f,%f' % (x, y) for x, y in coord_pairs))
out_file.write('\n\n')
out_file.write('%f %f %f %f' % (x_min, x_max, y_min, y_max))
with open(...) as <var_name> gives you guaranteed closing of the file handle as with try-finally; also, it's shorter than open(...) and .close() on separate lines. Also, with can be used for other purposes, but is commonly used for dealing with files. I suggest you look up how to use try-finally as well as with/context managers in Python, in addition to everything else you might have learned here.
Your code looks more like C than like Python; it is quite unidiomatic. I suggest you read the Python tutorial to find some inspiration. For example, iterating using a while loop is usually the wrong approach. The string module is deprecated for the most part, <> should be !=, you don't need to call str() on an object that's already a string...
Then, there are some errors. For example, sorted() returns a sorted version of the iterable you're passing - you need to assign that to something, or the result will be discarded. But you're calling it on a string, anyway, which won't give you the desired result. You also wrote x[o] where you clearly meant x[0].
You should be using something like this (assuming Python 2):
with open(r'C:\coordinates.txt') as infile:
values = []
for line in infile:
values.append(map(float, line.split()))
values.sort()
with open(r'C:\coordinates_sorted.txt', 'w') as outfile:
for value in values:
outfile.write("{:.1f},{:.1f}\n".format(*value))

Why doesn't this return the average of the column of the CSV file?

def averager(filename):
f=open(filename, "r")
avg=f.readlines()
f.close()
avgr=[]
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
avgr+=str((avg[x[i]]))
x+=1
final+=str((sum(avgr)/(len(avgr))))
clear(avgr)
i+=1
return final
The error I get is:
File "C:\Users\konrad\Desktop\exp\trail3.py", line 11, in averager
avgr+=str((avg[x[i]]))
TypeError: 'int' object has no attribute '__getitem__'
x is just an integer, so you can't index it.
So, this:
x[i]
Should never work. That's what the error is complaining about.
UPDATE
Since you asked for a recommendation on how to simplify your code (in a below comment), here goes:
Assuming your CSV file looks something like:
-9,2,12,90...
1423,1,51,-12...
...
You can read the file in like this:
with open(<filename>, 'r') as file_reader:
file_lines = file_reader.read().split('\n')
Notice that I used .split('\n'). This causes the file's contents to be stored in file_lines as, well, a list of the lines in the file.
So, assuming you want the ith column to be summed, this can easily be done with comprehensions:
ith_col_sum = sum(float(line.split(',')[i]) for line in file_lines if line)
So then to average it all out you could just divide the sum by the number of lines:
average = ith_col_sum / len(file_lines)
Others have pointed out the root cause of your error. Here is a different way to write your method:
def csv_average(filename, column):
""" Returns the average of the values in
column for the csv file """
column_values = []
with open(filename) as f:
reader = csv.reader(f)
for row in reader:
column_values.append(row[column])
return sum(column_values) / len(column_values)
Let's pick through this code:
def averager(filename):
averager as a name is not as clear as it could be. How about averagecsv, for example?
f=open(filename, "r")
avg=f.readlines()
avg is poorly named. It isn't the average of everything! It's a bunch of lines. Call it csvlines for example.
f.close()
avgr=[]
avgr is poorly named. What is it? Names should be meaningful, otherwise why give them?
final=""
x=0
i=0
while i < range(len(avg[0])):
while x < range(len(avg)):
As mentioned in comments, you can replace these with for loops, as in for i in range(len(avg[0])):. This saves you from needing to declare and increment the variable in question.
avgr+=str((avg[x[i]]))
Huh? Let's break this line down.
The poorly named avg is our lines from the csv file.
So, we index into avg by x, okay, that would give us the line number x. But... x[i] is meaningless, since x is an integer, and integers don't support array access. I guess what you're trying to do here is... split the file into rows, then the rows into columns, since it's csv. Right?
So let's ditch the code. You want something like this, using the split http://docs.python.org/2/library/stdtypes.html#str.split function:
totalaverage = 0
for col in range(len(csvlines[0].split(","))):
average = 0
for row in range(len(csvlines)):
average += int(csvlines[row].split(",")[col])
totalaverage += average/len(csvlines)
return totalaverage
BUT wait! There's more! Python has a built in csv parser that is safer than splitting by ,. Check it out here: http://docs.python.org/2/library/csv.html
In response to OP asking how he should go about this in one of the comments, here is my suggestion:
import csv
from collections import defaultdict
with open('numcsv.csv') as f:
reader = csv.reader(f)
numbers = defaultdict(list) #used to avoid so each column starts with a list we can append to
for row in reader:
for column, value in enumerate(row,start=1):
numbers[column].append(float(value)) #convert the value to a float 1. as the number may be a float and 2. when we calc average we need to force float division
#simple comprehension to print the averages: %d = integer, %f = float. items() goes over key,value pairs
print('\n'.join(["Column %d had average of: %f" % (i,sum(column)/(len(column))) for i,column in numbers.items()]))
Producing
>>>
Column 1 had average of: 2.400000
Column 2 had average of: 2.000000
Column 3 had average of: 1.800000
For a file:
1,2,3
1,2,3
3,2,1
3,2,1
4,2,1
Here's two methods. The first one just gets the average for the line (what your code above looks like it's doing). The second gets the average for a column (which is what your question asked)
''' This just gets the avg for a line'''
def averager(filename):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0
for i in xrange(len(avg)):
count += len(avg[i])
return count/len(avg)
''' This gets a the avg for all "columns"
char is what we split on , ; | (etc)
'''
def averager2(filename, char):
f=open(filename, "r")
avg = f.readlines()
f.close()
count = 0 # count of items
total = 0 # sum of all the lengths
for i in xrange(len(avg)):
cols = avg[i].split(char)
count += len(cols)
for j in xrange(len(cols)):
total += len(cols[j].strip()) # Remove line endings
return total/float(count)

Categories