Reading file string into an array (In a pythonic way) - python

I'm reading lines from a file to then work with them. Each line is composed solely by float numbers.
I have pretty much everything sorted up to convert the lines into arrays.
I basically do (pseudopython code)
line=file.readlines()
line=line.split(' ') # Or whatever separator
array=np.array(line)
#And then iterate over every value casting them as floats
newarray[i]=array.float(array[i])
This works, buts seems a bit counterintuitive and antipythonic, I wanted to know if there is a better way to handle the inputs from a file to have at the end an array full of floats.

Quick answer:
arrays = []
for line in open(your_file): # no need to use readlines if you don't want to store them
# use a list comprehension to build your array on the fly
new_array = np.array((array.float(i) for i in line.split(' ')))
arrays.append(new_array)
If you process often this kind of data, the csv module will help.
import csv
arrays = []
# declare the format of you csv file and Python will turn line into
# lists for you
parser = csv.reader(open(your_file), delimiter=' '))
for l in parser:
arrays.append(np.array((array.float(i) for i in l)))
If you feel wild, you can even make this completly declarative:
import csv
parser = csv.reader(open(your_file), delimiter=' '))
make_array = lambda row : np.array((array.float(i) for i in row))
arrays = [make_array(row) for row in parser]
And if you realy want you colleagues to hate you, you can make a one liner (NOT PYTHONIC AT ALL :-):
arrays = [np.array((array.float(i) for i in r)) for r in csv.reader(open(your_file), delimiter=' '))]
Stripping all the boiler plate and flexibility, you can end up with a clean and quite readable one liner. I wouldn't use it because I like the refatoring potential of using csv, but it can be good enought. It's a grey zone here, so I wouldn't say it's Pythonic, but it's definitly handy.
arrays = [np.array((array.float(i) for i in l.split())) for l in open(your_file))]

If you want a numpy array and each row in the text file has the same number of values:
a = numpy.loadtxt('data.txt')
Without numpy:
with open('data.txt') as f:
arrays = list(csv.reader(f, delimiter=' ', quoting=csv.QUOTE_NONNUMERIC))
Or just:
with open('data.txt') as f:
arrays = [map(float, line.split()) for line in f]

How about the following:
import numpy as np
arrays = []
for line in open('data.txt'):
arrays.append(np.array([float(val) for val in line.rstrip('\n').split(' ') if val != '']))

One possible one-liner:
a_list = [map(float, line.split(' ')) for line in a_file]
Note that I used map() here instead of a nested list comprehension to aid readability.
If you want a numpy array:
an_array = np.array([map(float, line.split(' ')) for line in a_file])

I would use regular expressions
import re
all_lines = ''.join( file.readlines() )
new_array = np.array( re.findall('[\d.E+-]+', all_lines), float)
np.reshape( new_array, (m,n) )
First merging the files into one long string, and then extracting only the expressions corresponding to floats ( '[\d.E+-]' for scientific notation, but you can also use '[\d.]' for only float expressions).

Related

Write an large array to a text file with specific format (as it is)

I need to write several NxM arrays generated with numpy to ascii files in the following format:
tab separated values, row per line, no brackets, the dimension of the array (columns rows) in the header.
I have tried something like this:
import numpy as np
mat = np.random.random((73,92))
dim = mat.shape
header = "{col}\t{row}\n".format(row = dim[0], col = dim[1])
content = str(mat).replace(' [', '').replace('[', '').replace(']', '').replace(' ', '\t')
mat_PM = open('./output_file.dat', "w+")
mat_PM.write(header)
mat_PM.write(content)
mat_PM.close()
which returns a file with a summarized array, like this:
How do we print the full array?.
Any help is much appreciated.
Numpy has a function to write arrays to text files with lots of formatting options - see the docs for numpy.savetxt and the user guide intro to saving and loading numpy objects.
In your case, something like the following should work:
with open('./output_file.dat', "w+") as mat_PM:
mat_PM.write(header)
np.savetxt(mat_PM, mat, fmt="%9f", delimiter="\t")
I would use Python's str.join() method in order to manually create the content string (instead of using str() on the matrix):
content = "\n".join(["\t".join([str(value) for value in row]) for row in mat])
Or the same technique but without using list comprehensions:
row_strings = []
for row in mat:
values = []
for value in row:
values.append(str(value))
row_strings.append("\t".join(values))
content = "\n".join(row_strings)
If instead you want to use numpy's built in string method, you can configure numpy to return the whole array (as explained in this answer):
np.set_printoptions(threshold=np.inf, linewidth=np.inf)
content = str(mat).replace(' [', '').replace('[', '').replace(']', '').replace(' ', '\t')
Hope this is useful :)

Read in lines from text line as separate arrays

I have a text file containing lines of strings that resemble an array format. I initially had a list of numpy arrays, and read them into the file like this, where each array is about 5 floats:
import numpy as np
parameters = [np.array(...), np.array(...), ...]
with open('params.txt', 'w') as f:
for param in parameters:
f.write(str(param)+'\n')
Now I'd like to read them back out, as a list of separate arrays. I'm having issues with this however -- below is what I'm trying to do:
parameters = []
with open('params.txt', 'r') as f:
for line in f:
parameters.append(np.array(line))
But now when I later try to index elements in these arrays and use list comprehension, like: [params[2] for params in parameters], I get this error: IndexError: too many indices for array.
I have also tried reading them out with line.split(','), but this didn't give me what I wanted and just messed up the formatting further. How can I accomplish this?
The format of my text file:
[242.1383, 131.087, 1590.853, 1306.09, 783.979]
[7917.102, 98.12, 21.43, 13.1383, 6541.33]
[823.74, 51.31, 9622.434, 974.11, 980.177]
...
What I want:
parameters = [np.array([242.1383, 131.087, 1590.853, 1306.09, 783.979]), np.array([7917.102, 98.12, 21.43, 13.1383, 6541.33]), np.array([823.74, 51.31, 9622.434, 974.11, 980.177]), ...]
I figured out a slightly simpler way to accomplish this without having to worry about all the string parsing, using regex:
import re
parameters = []
with open('params.txt', 'r') as f:
for line in f:
set = [float(value) for value in re.findall('\d+\.?\d*', line)]
parameters.append(np.array(set))
are you looking for something like this?
parameters = []
for line in f.readlines():
y = [value for value in line.split()]
parameter.append( y )
would be easier if I knew what the text file looked like if would would show the format of the text file you're trying to read from

Python - Read in Comma Separated File, Create Two lists

New to Python here and I'm trying to learn/figure out the basics. I'm trying to read in a file in Python that has comma separated values, one to a line. Once read in, these values should be separated into two lists, one list containing the value before the "," on each line, and the other containing the value after it.
I've played around with it for quite a while, but I just can't seem to get it.
Here's what I have so far...
with open ("mid.dat") as myfile:
data = myfile.read().replace('\n',' ')
print(data)
list1 = [x.strip() for x in data.split(',')]
print(list1)
list2 = ?
List 1 creates a list, but it's not correct. List 2, I'm not even sure how to tackle.
PS - I have searched other similar threads on here, but none of them seem to address this properly. The file in question is not a CSV file, and needs to stay as a .dat file.
Here's a sample of the data in the .dat file:
113.64,889987.226
119.64,440987774.55
330.43,446.21
Thanks.
Use string slicing:
list1= []
list2 = []
with open ("mid.dat") as myfile:
for line in myfile:
line = line.split(",").rstrip()
list1.append( line[0])
list2.append( line[1])
Python's rstrip() method strips all kinds of trailing whitespace by default, so removes return carriage "\n" too
If you want to use only builtin packages, you can use csv.
import csv
with open("mid.dat") as myfile:
csv_records = csv.reader(myfile)
list1 = []
list2 = []
for row in csv_records:
list1.append(row[0])
list2.append(row[1])
Could try this, which creates lists of floats not strings however:
from ast import literal_eval
with open("mid.dat") as f:
list1, list2 = map(list, (zip(*map(literal_eval, f.readlines()))))
Can be simplified if you don't mind list1 and list2 as tuples.
The list(*zip(*my_2d_list)) pattern is a pretty common way of transposing 2D lists using only built-in functions. It's useful in this scenario because it's easy to obtain a list (call this result) of tuples on each line in the file (where result[0] would be the first tuple, and result[n] would be the nth), and then transpose result (call this resultT) such that resultT[0] would be all the 'left values' and resultT[1] would be the 'right values'.
let's keep it very simple.
list1 = []
list2 = []
with open ("mid.dat") as myfile:
for line in myfile:
x1,x2 = map(float,line.split(','))
list1.append(x1)
list2.append(x2)
print(list1)
print(list2)
You could do this with pandas.
import pandas as pd
df = pd.read_csv('data.csv', columns=['List 1','List 2'])
If your data is a text file the respective function also exists in the pandas package. Pandas is a very powerful tool for data such as yours.
After doing so you can split your data into two independent dataframes.
list1 = df['List 1']
list2 = df['List 2']
I would stick to a dataframe because data manipulation and analysis is much easier within the pandas framework.
Here is my suggestion to be short and readable, without any additional packages to install:
with open ("mid.dat") as myfile:
listOfLines = [line.rstrip().split(',') for line in myfile]
list1 = [line[0] for line in listOfLines]
list2 = [line[1] for line in listOfLines]ility
Note: I used rstrip() to remove the end of line character.
Following is a solution obtained by correcting your own attempt:
with open("test.csv", "r") as myfile:
datastr = myfile.read().replace("\n",",")
datalist = datastr.split(",")
list1 = []; list2=[]
for i in range(len(datalist)-1): # ignore empty last item of list
if i%2 ==0:
list1.append(datalist[i])
else:
list2.append(datalist[i])
print(list1)
print(list2)
Output:
['113.64', '119.64', '330.43']
['889987.226', '440987774.55', '446.21']

TypeError in for loop

I'm having trouble with some code, where I have a text file with 633,986 tuples, each with 3 values (example: the first line is -0.70,0.34,1.05). I want to create an array where I take the magnitude of the 3 values in the tuple, so for elements a,b,c, I want magnitude = sqrt(a^2 + b^2 + c^2).
However, I'm getting an error in my code. Any advice?
import math
fname = '\\pathname\\GerrysTenHz.txt'
open(fname, 'r')
Magn1 = [];
for i in range(0, 633986):
Magn1[i] = math.sqrt((fname[i,0])^2 + (fname[i,1])^2 + (fname[i,2])^2)
TypeError: string indices must be integers, not tuple
You need to open the file properly (use the open file object and the csv module to parse the comma-separated values), read each row and convert the strings into float numbers, then apply the correct formula:
import math, csv
fname = '\\pathname\\GerrysTenHz.txt'
magn1 = []
with open(fname, 'rb') as inputfile:
reader = csv.reader(inputfile)
for row in reader:
magn1.append(math.sqrt(sum(float(c) ** 2 for c in row)))
which can be simplified with a list comprehension to:
import math, csv
fname = '\\pathname\\GerrysTenHz.txt'
with open(fname, 'rb') as inputfile:
reader = csv.reader(inputfile)
magn1 = [math.sqrt(sum(float(c) ** 2 for c in row)) for row in reader]
The with statement assigns the open file object to inputfile and makes sure it is closed again when the code block is done.
We add up the squares of the column values with sum(), which is fed a generator expression that converts each column to float() before squaring it.
You need to use the lines of the file and the csv module (as Martijn Pieters points out) to examine each value. This can be done with a list comprehension and with:
with open(fname) as f:
reader = csv.reader(f)
magn1 = [math.sqrt(sum(float(i)**2 for i in row)) for row in reader]
just make sure you import csv as well
To explain the issues your having (there are quite a few) I'll walk through a more drawn out way to do this.
you need to use what openreturns. open takes a string and returns a file object.
f = open(fname)
I'm assuming the range in your for loop is suppose to be the number of lines in the file. You can instead iterate over each line of the file one by one
for line in f:
Then to get the numbers on each line, use the str.split method of to split the line on the commas
x, y, z = line.split(',')
convert all three to floats so you can do math with them
x, y, z = float(x), float(y), float(z)
Then use the ** operator to raise to a power, and take the sqrt of the sum of the three numbers.
n = math.sqrt(x**2 + y**2 + z**2)
Finally use the append method to add to the back of the list
Magn1.append(n)
Let's look at fname. That's a string. So if you try to subscript it (i.e., fname[i, 0]), you should use an integer, and you'll get back the character at index i. Since you're using [i, 0] as the string indices, you're passing a tuple. That's no integer!
Really, you should be reading a line from the file, then doing things with that. So,
with(open(fname, 'r')) as f: # You're also opening the file and doing nothing with it
for line in f:
print('doing something with %s' % line)

Slice specific characters in CSV using python

I have data in tab delimited format that looks like:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match and the genfromtxt in numpy. This example is as far as I have gotten:
import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
m = re.match('[0-9]/[0-9]', i)
if m:
print m.group(0),
else:
print "NA",
This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.
Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?
Unless you really want to use NumPy, try this:
file = open('home/python/batch1.hg19.table')
for line in file:
for cell in line.split('\t'):
print(cell[:3])
Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.
Numpy is great when you want to load in an array of numbers.
The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.
Here's a simple way to do it without numpy:
result=[]
with open(csvfile,'r') as f:
for line in f:
row=[]
for text in line.split('\t'):
match=re.search('([0-9]/[0-9])',text)
if match:
row.append(match.group(1))
else:
row.append("NA")
result.append(row)
print(result)
yields
# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]
on this data:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
Its pretty easy to parse the whole file without regular expressions:
for line in open('yourfile').read().split('\n'):
for token in line.split('\t'):
print token[:3] if token else 'N\A'
I haven't written python in a while. But I would probably write it as such.
file = open("home/python/batch1.hg19.table")
for line in file:
columns = line.split("\t")
for column in columns:
print column[:3]
file.close()
Of course if you need to validate the first three characters, you'll still need the regex.

Categories