I have a bit trouble with some data stored in a text file on hand for regression analysis using Python.
The data are stored in the format that look like this:
2104,3,399900 1600,3,329900 2400,3,369000 ....
I need to do some analysis like finding mean by this:
(2104+1600+...)/number of data
I think the appropriate steps is to store the data into array. But I have no idea how to store it. I think of two ways to do so. The first one is to set 3 array that stores like
a=[2104 1600 2400 ...] b=[3 3 3 ...] c=[399900 329900 36000 ...]
The second way is to store in
a=[2104 3 399900], b=[1600 3 329900] and so on.
Which one is better?
Also, how to write code that allows the data can be stored into array? I think of like this:
with open("file.txt", "r") as ins:
array = []
elt.strip(',."\'?!*:') for line in ins:
array.append(line)
Is that correct?
You could use :
with open('data.txt') as data:
substrings = data.read().split()
values = [map(int, substring.split(',')) for substring in substrings]
average = sum([a for a, b, c in values]) / float(len(values))
print average
With this data.txt, :
2104,3,399900 1600,3,329900 2400,3,369000
2105,3,399900 1601,3,329900 2401,3,369000
It outputs :
2035.16666667
Using pandas and numpy you can get the data into an array as follows:
In [37]: data = "2104,3,399900 1600,3,329900 2400,3,369000"
In [38]: d = pd.read_csv(StringIO.StringIO(data), sep=',| ', header=None, index_col=None, engine="python")
In [39]: d.values.reshape(3, d.shape[1]/3)
Out[39]:
array([[ 2104, 3, 399900],
[ 1600, 3, 329900],
[ 2400, 3, 369000]])
Instead of having multiple arrays a, b, c... you could store your data as an array of arrays (a 2 dimensional array). For example:
[[2104,3,399900],
[1600,3,329900],
[2400,3,369000]...]
This way you don't have to deal with dynamically naming your arrays. How you store your data, i.e. 3 * array of length n or n * array of length 3 is up to you. I would prefer the second way. To read the data into your array you should then use the split() function, which will split your input into an array. So in your case:
with open("file.txt", "r") as ins:
tmp = ins.read().split(" ")
array = [i.split(",") for i in tmp]
>>> array
[['2104', '3', '399900'], ['1600', '3', '329900'], ['2400', '3', '369000']]
Edit:
To find the mean e.g. for the first element in each list you could do the following:
arraymean = sum([int(i[0]) for i in array]) / len(array)
Where the 0 in i[0] specifies the first element in each list. Note that this code uses list comprehension, which you can learn more about in this post if you want to.
Also this code stores the values in the array as strings, hence the cast to int in the part to get the mean. If you want to store the data as int directly just edit the part in the file reading section:
array = [[int(j) for j in i.split(",")] for i in tmp]
This a quick solution without error checking (using a list comprehension technique, PEP202). But if your file has a consistent format you can do the following:
import numpy as np
a = np.array([np.array(i.split(",")).astype("float") for i in open("example.txt").read().split(" ")])
Should you print it:
print(a)
print("Mean of column 0: ", np.mean(a[:, 0]))
You'll obtain the following:
[[ 2.10400000e+03 3.00000000e+00 3.99900000e+05]
[ 1.60000000e+03 3.00000000e+00 3.29900000e+05]
[ 2.40000000e+03 3.00000000e+00 3.69000000e+05]]
Mean of column 0: 2034.66666667
Notice how, in the code snippet, specified the "," as separator inside triplet, and the space " " as separator between triplets. This is the exact contents of the file I used as an example:
2104,3,399900 1600,3,329900 2400,3,369000
Related
I want to make a two-dimensional array with string values from my data document csv, but I have trouble with Indexes
my data =
1.alquran,tunjuk,taqwa,perintah,larang,manfaat 2.taqwa,ghaib,allah,malaikat,surga,neraka,rasul,iman,ibadah,manfaat,taat,ridha
3.taqwa,alquran,hadist,kitab,allah,akhirat,ciri
in a document csv
def ubah(kata):
a=[]
for i in range (0,19):
a.append([kata.values[i,j] for j in range (0,13)])
return a
and the wanted result is
[['alquran','tunjuk','taqwa','perintah','larang','manfaat'],<br>['taqwa','ghaib','allah','malaikat','surga','neraka','rasul','iman','ibadah','manfaat','taat','ridha'],<br>['taqwa','alquran','hadist','kitab','allah','akhirat','ciri']]
You can modify your function as:
def ubah(kata):
a = []
line = kata.split("\n") # will create an array of rows
for i in range(len(line)):
a.append(line[i].split(",")) # will add the separated values
return a
df = open("data.csv", 'r')
kata = df.read()
dataarray = ubah(kata) # calling the function
print(dataarray)
The above program gives the result as you want, like
[['alquran', 'tunjuk', 'taqwa', 'perintah', 'larang', 'manfaat'], ['taqwa', 'ghaib', 'allah', 'malaikat', 'surga', 'neraka', 'rasul', 'iman', 'ibadah', 'manfaat', 'taat', 'ridha '], ['taqwa', 'alquran', 'hadist', 'kitab', 'allah', 'akhirat', 'ciri']]
Hope this helps.
Please remove values[i,j] from the for loop and replace it with j.
for i in range (0,19):
a.append([kata[j] for j in range (0,3)])
How can I import an array to python (numpy.arry) from a file and that way the file must be written if it doesn't already exist.
For example, save out a matrix to a file then load it back.
Checkout the entry on the numpy example list. Here is the entry on .loadtxt()
>>> from numpy import *
>>>
>>> data = loadtxt("myfile.txt") # myfile.txt contains 4 columns of numbers
>>> t,z = data[:,0], data[:,3] # data is 2D numpy array
>>>
>>> t,x,y,z = loadtxt("myfile.txt", unpack=True) # to unpack all columns
>>> t,z = loadtxt("myfile.txt", usecols = (0,3), unpack=True) # to select just a few columns
>>> data = loadtxt("myfile.txt", skiprows = 7) # to skip 7 rows from top of file
>>> data = loadtxt("myfile.txt", comments = '!') # use '!' as comment char instead of '#'
>>> data = loadtxt("myfile.txt", delimiter=';') # use ';' as column separator instead of whitespace
>>> data = loadtxt("myfile.txt", dtype = int) # file contains integers instead of floats
Another option is numpy.genfromtxt, e.g:
import numpy as np
data = np.genfromtxt("myfile.dat",delimiter=",")
This will make data a numpy array with as many rows and columns as are in your file
(I know the question is old, but I think this might be good as a reference for people with similar questions)
If you want to load data from an ASCII/text file (which has the benefit or being more or less human-readable and easy to parse in other software), numpy.loadtxt is probably what you want:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
If you just want to quickly save and load numpy arrays/matrices to and from a file, take a look at numpy.save and numpy.load:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html
In Python, Storing a bare python list as a numpy.array and then saving it out to file, then loading it back, and converting it back to a list takes some conversion tricks. The confusion is because python lists are not at all the same thing as numpy.arrays:
import numpy as np
foods = ['grape', 'cherry', 'mango']
filename = "./outfile.dat.npy"
np.save(filename, np.array(foods))
z = np.load(filename).tolist()
print("z is: " + str(z))
This prints:
z is: ['grape', 'cherry', 'mango']
Which is stored on disk as the filename: outfile.dat.npy
The important methods here are the tolist() and np.array(...) conversion functions.
Have a look at SciPy cookbook. It should give you an idea of some basic methods to import /export data.
If you save/load the files from your own Python programs, you may also want to consider the Pickle module, or cPickle.
I have a file which has scientific data expressed in scientific notation. The format is following where always bunch of 6 rows of numbers appear together,
4.748257444721457E-004
-4.058788876602824E-006
-1.494658656964534E-004
4.686186383664201E-006
3.840708360798801E-006
-3.237680480600605E-005
-3.237680480600605E-005
5.290165586028430E-005
-1.312378015891650E-005
-9.693957759497108E-006
3.184131106435972E-005
I am trying to construct a matrix (in python) out of the above to the form as shown below;
4.748257444721457E-004 -3.237680480600605E-005
-4.058788876602824E-006 5.290165586028430E-005
-1.494658656964534E-004 -1.312378015891650E-005
4.686186383664201E-006 -9.693957759497108E-006
3.840708360798801E-006 3.184131106435972E-005
I am wondering if such construction as a matrix is possible in python (which can then be used for further mathematical analysis) ?
I tried this approach.
Convert the numbers to decimal then make a matrix using bash script then load in python as txt.
However, I do not want to get rid of the floating numbers (causes loss of precision), hence I am thinking of a loop approach to generate this matrix. The real data set has 54x54 values.
It may help you.
import numpy as np
matrix = []
row = []
with open("input.txt") as f:
for line in f:
line = line.rstrip('\n')
if not line:
if len(row) != 0:
matrix.append(row)
row = []
continue
row.append(float(line))
if len(row) != 0:
matrix.append(row)
matrix = np.asarray(matrix).T
print(matrix)
It prints:
[[ 4.74825744e-04 -3.23768048e-05]
[ -4.05878888e-06 -3.23768048e-05]
[ -1.49465866e-04 5.29016559e-05]
[ 4.68618638e-06 -1.31237802e-05]
[ 3.84070836e-06 -9.69395776e-06]
[ 3.84070836e-06 3.18413111e-05]]
You can also print the matrix by iterating through rows and columns.
for row in range(matrix.shape[0]):
for column in range(matrix.shape[1]):
print(matrix[row][column], end=' ')
print()
It prints:
0.000474825744472 -3.2376804806e-05
-4.0587888766e-06 -3.2376804806e-05
-0.000149465865696 5.29016558603e-05
4.68618638366e-06 -1.31237801589e-05
3.8407083608e-06 -9.6939577595e-06
3.8407083608e-06 3.18413110644e-05
You can also print with precision in scientific format.
print('{:.4e}'.format(matrix[row][column]), end=' ')
The input file - input.txt contains:
4.748257444721457e-004
-4.058788876602824e-006
-1.494658656964534e-004
4.686186383664201e-006
3.840708360798801e-006
3.840708360798801e-006
-3.237680480600605e-005
-3.237680480600605e-005
5.290165586028430e-005
-1.312378015891650e-005
-9.693957759497108e-006
3.184131106435972e-005
How can I import an array to python (numpy.arry) from a file and that way the file must be written if it doesn't already exist.
For example, save out a matrix to a file then load it back.
Checkout the entry on the numpy example list. Here is the entry on .loadtxt()
>>> from numpy import *
>>>
>>> data = loadtxt("myfile.txt") # myfile.txt contains 4 columns of numbers
>>> t,z = data[:,0], data[:,3] # data is 2D numpy array
>>>
>>> t,x,y,z = loadtxt("myfile.txt", unpack=True) # to unpack all columns
>>> t,z = loadtxt("myfile.txt", usecols = (0,3), unpack=True) # to select just a few columns
>>> data = loadtxt("myfile.txt", skiprows = 7) # to skip 7 rows from top of file
>>> data = loadtxt("myfile.txt", comments = '!') # use '!' as comment char instead of '#'
>>> data = loadtxt("myfile.txt", delimiter=';') # use ';' as column separator instead of whitespace
>>> data = loadtxt("myfile.txt", dtype = int) # file contains integers instead of floats
Another option is numpy.genfromtxt, e.g:
import numpy as np
data = np.genfromtxt("myfile.dat",delimiter=",")
This will make data a numpy array with as many rows and columns as are in your file
(I know the question is old, but I think this might be good as a reference for people with similar questions)
If you want to load data from an ASCII/text file (which has the benefit or being more or less human-readable and easy to parse in other software), numpy.loadtxt is probably what you want:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
If you just want to quickly save and load numpy arrays/matrices to and from a file, take a look at numpy.save and numpy.load:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html
In Python, Storing a bare python list as a numpy.array and then saving it out to file, then loading it back, and converting it back to a list takes some conversion tricks. The confusion is because python lists are not at all the same thing as numpy.arrays:
import numpy as np
foods = ['grape', 'cherry', 'mango']
filename = "./outfile.dat.npy"
np.save(filename, np.array(foods))
z = np.load(filename).tolist()
print("z is: " + str(z))
This prints:
z is: ['grape', 'cherry', 'mango']
Which is stored on disk as the filename: outfile.dat.npy
The important methods here are the tolist() and np.array(...) conversion functions.
Have a look at SciPy cookbook. It should give you an idea of some basic methods to import /export data.
If you save/load the files from your own Python programs, you may also want to consider the Pickle module, or cPickle.
I need to write a series of matrices out to a plain text file from python. All my matricies are in float format so the simple
file.write() and file.writelines()
do not work. Is there a conversion method I can employ that doesn't have me looping through all the lists (matrix = list of lists in my case) converting the individual values?
I guess I should clarify, that it needn't look like a matrix, just the associated values in an easy to parse list, as I will be reading in later. All on one line may actually make this easier!
m = [[1.1, 2.1, 3.1], [4.1, 5.1, 6.1], [7.1, 8.1, 9.1]]
file.write(str(m))
If you want more control over the format of each value:
def format(value):
return "%.3f" % value
formatted = [[format(v) for v in r] for r in m]
file.write(str(formatted))
the following works for me:
with open(fname, 'w') as f:
f.writelines(','.join(str(j) for j in i) + '\n' for i in matrix)
Why not use pickle?
import cPickle as pickle
pckl_file = file("test.pckl", "w")
pickle.dump([1,2,3], pckl_file)
pckl_file.close()
for row in matrix:
file.write(" ".join(map(str,row))+"\n")
This works for me... and writes the output in matrix format
import pickle
# write object to file
a = ['hello', 'world']
pickle.dump(a, open('delme.txt', 'wb'))
# read object from file
b = pickle.load(open('delme.txt', 'rb'))
print b # ['hello', 'world']
At this point you can look at the file 'delme.txt' with vi
vi delme.txt
1 (lp0
2 S'hello'
3 p1
4 aS'world'
5 p2
6 a.