If I have a txt file with the contents as such:
4 2 45 21
0 92 12 2
345 9 3 4
1 2 39 93
Is there a quick and easy way to turn this into a matrix of int?
Right now, I have accessed the file this way:
file = open(testFile, 'r')
data = []
for row in file:
data.append(row)
This stores the data as an array where each line is a string. Instead of going through and converting the data types and then turning it into a matrix, is there a way I can immediately store this data in matrix form as ints as I read it in?
Simple way of doing this is:
>>> for row in file:
... data.append([int(x) for x in row.split()])
...
>>> data
[[4, 2, 45, 21], [0, 92, 12, 2], [345, 9, 3, 4], [1, 2, 39, 93]]
IMO, this is the most pythonic way
for a nested list:
text = """4 2 45 21
0 92 12 2
345 9 3 4
1 2 39 93"""
[[*map(int, line.split())] for line in text.split('\n')]
Out[16]: [[4, 2, 45, 21], [0, 92, 12, 2], [345, 9, 3, 4], [1, 2, 39, 93]]
If you're okay with having your data stored as a numpy.ndarray, you can use numpy's genfromtext() with the dtype flag set to int:
from StringIO import StringIO
import numpy as np
text = """4 2 45 21
0 92 12 2
345 9 3 4
1 2 39 93"""
a = np.genfromtxt(StringIO(text), dtype=int) #replace the arg with your filename
print(a)
#[[ 4 2 45 21]
# [ 0 92 12 2]
# [345 9 3 4]
# [ 1 2 39 93]]
An alternative is to use loadtxt() instead of genfromtxt() as #Zhiya pointed out in the comments.
a = np.loadtxt(StringIO(text), dtype=int)
As per this post, both functions are basically the same except that genfromtxt() provides more options for dealing with missing data.
Related
I have a list of variables. I want to assign name of this list to a column in dataframe. The name stress and its elements keep on change.
stress = ['M13', 'M14', 'M15', 'M16', 'M17', 'M18']
outputlist = [ 13, 14, 15, 16, 17 18 ] ### obtained from analysis
resultdf[stress] = outputlist ### I want to name the column same as list name.
I want something like this given below.
print(resultdf)
stress
0 13
1 14
2 15
3 16
4 17
5 18
It results error when I attempt to do this because whole list values getting list in column header. How to achieve this.
Just needs to be a string. You are trying to use a variable as a column name. Instead write
resultd["stress"] = outputlist
This might be what you're looking for, although I'm not sure what the result data looks like:
>>> stress = ['M13', 'M14', 'M15', 'M16', 'M17', 'M18']
>>> data = [[1,2,3,4,5,6], [7,8,9,10,11,12], [13,14,15,16,17,18], [19,20,21,22,23,24], [25,26,27,28,29,30], [31,32,33,34,35,36]]
>>> result = {x: y for x,y in zip(stress, data)}
>>> result
{'M13': [1, 2, 3, 4, 5, 6], 'M14': [7, 8, 9, 10, 11, 12], 'M15': [13, 14, 15, 16, 17, 18], 'M16': [19, 20, 21, 22, 23, 24], 'M17': [25, 26, 27, 28, 29, 30], 'M18': [31, 32, 33, 34, 35, 36]}
Then you can convert the dictionary to a DataFrame:
>>> import pandas as pd
>>> d = pd.DataFrame(result)
>>> d
M13 M14 M15 M16 M17 M18
0 1 7 13 19 25 31
1 2 8 14 20 26 32
2 3 9 15 21 27 33
3 4 10 16 22 28 34
4 5 11 17 23 29 35
5 6 12 18 24 30 36
Edit (based on your update)
If you literally just want a single column with the variable as the name, put it in quotes:
>>> d = pd.DataFrame({'stress': outputlist})
>>> d
stress
0 13
1 14
2 15
3 16
4 17
5 18
I have a larger 2 dimensional matrix which is 36*72 and I want to select a small matrix from it by using indexes.
The matrix looks like this:
[ [312, 113, 525, 543, ...] ,
[...],
[...],
... ].
And I print the shape like this:
print(array(matrix).shape)
(36, 72)
But when I try to print out the small matrix like this
print(matrix[6:9][9])
The error is "IndexError: list index out of range"
Then I tried
print(matrix[6:9,9])
It showed "TypeError: list indices must be integers, not tuple"
Then I tried
print(matrix[6:9][8:9])
I get the empty list. But when I tried
print(matrix[9][9])
It did give out some number.
With numpy arrays, you can use quite convenient indexing methods, which is a feature of numpy parts of which are refered to as fancy indexing.
Let's try that with a small example 2D-array:
import numpy as np
a=np.arange(48).reshape(6, 8)
print(a)
#[[ 0 1 2 3 4 5 6 7]
# [ 8 9 10 11 12 13 14 15]
# [16 17 18 19 20 21 22 23]
# [24 25 26 27 28 29 30 31]
# [32 33 34 35 36 37 38 39]
# [40 41 42 43 44 45 46 47]]
If you now want to index e.g. rows 2 and 3 and columns 3 to 6, you can simply write that down in slices, no matter if by constants or variables:
r1 = 2; r2 = 4
print(a[r1:r2, 3:7])
#[[19 20 21 22]
# [27 28 29 30]]
You might want to read further here: https://docs.scipy.org/doc/numpy/user/basics.indexing.html
Here's an example. I have a 3x3 matrix, named 'a' and I want to select the top left 2x2 matrix named 'c'.
>>> import numpy as np # importing numpy
>>> a=np.matrix('1 2 3;4 5 6;7 8 9') # creating an example matrix, named a
>>> a
matrix([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> b=[[a.item(0,0),a.item(0,1)],[a.item(1,0),a.item(1,1)]] # creating a list, with 1,1 1,2 2,1 and 2,2 indices of a. remember, in math indexing starts from 1 but in most programming languages, it starts from 0
>>> b
[[1, 2], [4, 5]]
>>> c=np.matrix(b) # creating an numpy matrix object from b which is a part of a
>>> c
matrix([[1, 2],
[4, 5]])
noe = Mx1[2:4, 2:4 ] # this is the salshuen, but yo use 2 becuse 0 is 1 like bit.
# Mx1 [row:colems , Colems:row ] |bns be cerefel
# It confusing but works
noe =[[ 8750. 8750.]
[ 8750. 70000.]]
Mx1 = [[ 8750. 8750. -8750. -8750.]
[ 8750. 8750. -8750. -8750.]
[-8750. -8750. 8750. 8750.]
[-8750. -8750. 8750. 70000.]]
I was wondering if there was a way to pass a list of values/corresponding values that I am remapping in my dataframe. I am using a truncated version of my dataset and would rather not have to update the code one by one, there are about 30 different unique QFundMaster variables in my data.
QFundMaster NPS
0 3 1
1 5 2
2 10 3
3 23 9
4 26 1
The code I am using to remap the data is as follows:
df['Fund'] = df['QFundMaster'] \
.map({3: 'Fund1'\
,5: 'Fund2'\
,10: 'Fund3'\
,23: 'Fund4'\
,26: 'Fund5'})
The code works perfectly fine, but was after a way to pass a list of values/new values so I don't have to edit the code one by one and to make it more efficient. Any help in the right direction would be appreciated. Thanks!
print(df.Fund)
0 Fund1
1 Fund2
2 Fund3
3 Fund4
4 Fund5
Name: Fund, dtype: object
If you have the list of old_values and new_values, you could do:
import pandas as pd
data = [[3, 1],
[5, 2],
[10, 3],
[23, 9],
[26, 1]]
df =pd.DataFrame(data=data, columns=['QFundMaster', 'NPS'])
old_values = [3, 5, 10, 23, 26]
new_values = ['Fund1', 'Fund2', 'Fund3', 'Fund4', 'Fund5']
df['Fund'] = df['QFundMaster'].map(dict(zip(old_values, new_values)))
print(df)
Output
QFundMaster NPS Fund
0 3 1 Fund1
1 5 2 Fund2
2 10 3 Fund3
3 23 9 Fund4
4 26 1 Fund5
I have got a file testforce.dat that shows values divided in 9 columns and 3 rows. The first 3 column represents:
p1 p2 p3 f1 f2 f3 r1 r2 r3
18 5 27 20 21 8 14 12 25
9 26 23 1 4 10 7 16 24
19 22 15 13 17 6 11 2 3
I have got 100 files of this fashion.
I now want to calculate for the file force_00000.dat the vector g = [sum(p1*f1), sum(p2*f2), sum(p3*f3)] but for the next file force_00001.dat the vector should use other columns h = [sum(p1*r1), sum(p2*r2), sum(p3*r3)].
At the moment I am using the glob function to read my files into arrays. It puts every row into one array.
I am not sure how to get my alternating array multiplication done and would appreciate any suggestions :)
import numpy as np
import glob
i = 100
for x in range(0,int(i)):
## turns x into a string and adds if necessary "0" to achieve a fixed digit number;
y = str(x).zfill(5)
## the structure of the forcefile is "force_[00000-00099]";
files = sorted(glob.glob('.//results/force/force_%s.dat' % y))
column_names=('#position')
print files
## loads the file data into arrays
arrays=[np.loadtxt(filename) for filename in files]
print arrays
Edit: I tested the load of the first file with:
b=np.array(arrays)
print b.shape
And I get (1,3,9) for the shape of my generated array.
Edit2: I had the idea to use "usecols" and then multiply the desired values:
xposition=[np.loadtxt(filename,usecols= (0,1,2)) for filename in files]
xforce1=[np.loadtxt(filename,usecols= (3,4,5)) for filename in files]
print xposition
print xforce1
xp=np.asarray(xposition)
xf1=np.asarray(xforce1)
print xp
g=np.multiply(xp,xf1)
print g
this generated the following output:
[[[ 360. 105. 216.]
[ 9. 104. 230.]
[ 247. 374. 90.]]]
which means I have (p11 and f11 being the values of the first row, p21 from second row...)
[[[p11*f11 p12*f12 p13*f13]
[p21*f21 p22*f22 p23*f23]
[p31*f31 p32*f32 p33*f33]]]
which seems like I am slmost done for atleast one file. The desired g(g1,g2,g3) should look like:
p11*f11+p21*f21+p31*f31= g1
p12*f12+p22*f22+p32*f32= g2
p13*f13+p23*f23+p33*f33= g3
Sorry if that is a totally newbie question but I am not so familliar with python yet :)
For the issue with the alternating values I was thinking about using an if function that checks if "i" in the loop is an even number
loadtxt returns an array. [loadtxt(name) for name in filenames] produces a list of arrays, one array per name. np.array([...]) produces an array from that list. If the individual arrays are all the same size, the resulting array will be 3d.
If you need to treat every other file differently you could access them as a set with indexing
arr[::2,...]
arr[1;:2,...]
To multiply the 2 sets of columns from your example file:
In [558]: txt=b"""p1 p2 p3 f1 f2 f3 r1 r2 r3
...: 18 5 27 20 21 8 14 12 25
...: 9 26 23 1 4 10 7 16 24
...: 19 22 15 13 17 6 11 2 3"""
In [560]: arr = np.loadtxt(txt.splitlines(),skiprows=1,dtype=int)
In [561]: arr
Out[561]:
array([[18, 5, 27, 20, 21, 8, 14, 12, 25],
[ 9, 26, 23, 1, 4, 10, 7, 16, 24],
[19, 22, 15, 13, 17, 6, 11, 2, 3]])
In [562]: arr[:, 0:3]*arr[:, 3:6]
Out[562]:
array([[360, 105, 216],
[ 9, 104, 230],
[247, 374, 90]])
In [563]: arr[:, 0:3]*arr[:, 6:9]
Out[563]:
array([[252, 60, 675],
[ 63, 416, 552],
[209, 44, 45]])
If arr was a 3d array from load multiple files,
arr1 = arr[::2,...]
arr2 = arr[1::2,...]
arr1[:,:,0:3] * arr1[:,:,3:6]
etc
If I have a data frame df (indexed by integer)
BBG.KABN.S BBG.TKA.S BBG.CON.S BBG.ISAT.S
index
0 -0.004881 0.008011 0.007047 -0.000307
1 -0.004881 0.008011 0.007047 -0.000307
2 -0.005821 -0.016792 -0.016111 0.001028
3 0.000588 0.019169 -0.000307 -0.001832
4 0.007468 -0.011277 -0.003273 0.004355
and I want to iterate though each element individually (by row and column) I know I need to use .iloc(row,column) but do I need to create 2 for loops (one for row and one for column) and how I would do that?
I guess it would be something like:
for col in rollReturnRandomDf.keys():
for row in rollReturnRandomDf.iterrows():
item = df.iloc(col,row)
But I am unsure of the exact syntax.
Maybe try using df.values.ravel().
import pandas as pd
import numpy as np
# data
# =================
df = pd.DataFrame(np.arange(25).reshape(5,5), columns='A B C D E'.split())
Out[72]:
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
# np.ravel
# =================
df.values.ravel()
Out[74]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24])
for item in df.values.ravel():
# do something with item