I'm working in Python and currently I have a list which looks like
['001 2.4600 0.46 2.36E+003 86.66 16.77 0.33 1.32E+003 74.41 17.61 0.40 2.21E+003 87.39 22.07',
'002 10.310 0.38 2.95E+002 76.88 4.53 0000 000000000 00000 0000 0.34 2.62E+002 97.36 4.41',
'003 74.840 0.63 5.07E+002 64.63 4.03 0.57 4.15E+002 61.96 3.99 0.63 5.43E+002 64.67 5.16',
...
and so on, with quite a few more elements. Each element of the list is a string, containing various figures which have spaces between them. i.e, as above, the first element has 001, 2.4600, 0.46 and so on.
The point is that I want to turn each element of the list into a row of an array. The aim is to have a large array giving me all the information which is currently just numbers separated by spaces inside strings in a list.
I'm sure I can use the built in array module to do this but I just can't figure out how.
Any ideas? Hope the question is clear.
Assuming you want floats in the final list of lists, try this:
>>> data = ['001 2.4600 0.46 2.36E+003 86.66 16.77 0.33 1.32E+003 74.41 17.61 0.40 2.21E+003 87.39 22.07', '002 10.310 0.38 2.95E+002 76.88 4.53 0000 000000000 00000 0000 0.34 2.62E+002 97.36 4.41', '003 74.840 0.63 5.07E+002 64.63 4.03 0.57 4.15E+002 61.96 3.99 0.63 5.43E+002 64.67 5.16']
>>> [list(map(float, row.split())) for row in data]
[[1.0, 2.46, 0.46, 2360.0, 86.66, 16.77, 0.33, 1320.0, 74.41, 17.61, 0.4, 2210.0, 87.39, 22.07], [2.0, 10.31, 0.38, 295.0, 76.88, 4.53, 0.0, 0.0, 0.0, 0.0, 0.34, 262.0, 97.36, 4.41], [3.0, 74.84, 0.63, 507.0, 64.63, 4.03, 0.57, 415.0, 61.96, 3.99, 0.63, 543.0, 64.67, 5.16]]
map just says 'do this function (float()) on everything in this list (the result of split(), which is a list of strings)'. In Python 3 it returns an iterator, so we have to ask for the list() of it. It's often better to use a for loop or list comprehension instead of map, but in this case it's handy.
Your idea of using the array module is probably bogus, as an array.array object is, essentially, a list with constrained data type. You cannot use vectorized operations on them. Further, an array.array is a 1D object.
That said, you possibly want to use the numpy module, whose array object is a multidimensional array on which you can operate at your will.
# idiomatic manner of importing numpy
import numpy as np
data = ['1 2 3.', '4. 5 8']
arraydata = np.array([[float(n) for n in row.split()] for row in data])
print arraydata
# [[ 1. 2. 3.]
# [ 4. 5. 8.]]
Hopefully I understood correctly
res = []
for row in my_list:
res.append(list(map(float, row.split())))
Here you will have a matrix of values, in string format. Added conversion
Asuming your data is stored in a list called data you could use
data =[[int(el) for el in string.split(' ')] for string in data]
Related
I have a 1D array and want to find the correlation of the series of first 2 elements, then first 3 elements... until all elements.
I can do it with numpy in a loop; here is my code:
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7,4,8,-2,-3,-4,-6,-2,0,10,0,5,8])
correl = np.zeros(data.shape)
for i in range(1, data.shape[0]):
correl[i] = np.corrcoef(data[0: i+1], np.arange(i+1))[0, 1]
print(correl)
and the result is:
[ 0. -1. -0.397 0. 0.607 0.799 0.88 0.64 0.581 0.556
0.574 0.322 0.078 -0.02 -0.237 -0.383 -0.489 -0.572 -0.614 -0.634
-0.568 -0.59 -0.573 -0.533]
I wonder how I can make it in numpy without a loop, i.e. be smarter/more efficient
Any idea please?
I am new to python a bit.
I am trying to convert a dataframe to list after changing the datatype of a particular column to integer. The funny thing is when converted to list, the column still has float.
There are three columns in the dataframe, first two is float and I want the last to be integer, but it still comes as float.
If I change all to integer, then the list creates as integer.
0 1.53 3.13 0.0
1 0.58 2.83 0.0
2 0.28 2.69 0.0
3 1.14 2.14 0.0
4 1.46 3.39 0.0
... ... ... ...
495 2.37 0.93 1.0
496 2.85 0.52 1.0
497 2.35 0.39 1.0
498 2.96 1.68 1.0
499 2.56 0.16 1.0
Above is the Dataframe.
Below is the last column converted
#convert last column to integer datatype
data[6] = data[6].astype(dtype ='int64')
display(data.dtypes)
The below is converting the dataframe to list.
#Turn DF to list
data_to_List = data.values.tolist()
data_to_List
#below is what is shown now.
[[1.53, 3.13, 0.0],
[0.58, 2.83, 0.0],
[0.28, 2.69, 0.0],
[1.14, 2.14, 0.0],
[3.54, 0.75, 1.0],
[3.04, 0.15, 1.0],
[2.49, 0.15, 1.0],
[2.27, 0.39, 1.0],
[3.65, 1.5, 1.0],
I want the last column to be just 0 or 1 and not 0.0 or 1.0
Yes, you are correct pandas is converting int to float when you use data.values
You can convert your float to int by using the below list comprehension:
data_to_List = [[x[0],x[1],int(x[2])] for x in data.values.tolist()]
print(data_to_List)
[[1.53, 3.13, 0],
[0.58, 2.83, 0],
[0.28, 2.69, 0],
[1.14, 2.14, 0],
[1.46, 3.39, 0]]
I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64
I'd like to take the average of one vector based on grouping information in another vector. The two vectors are the same length. I've created a minimal example below based on averaging predictions for each user. How do I do that in NumPy?
>>> pred
[ 0.99 0.23 0.11 0.64 0.45 0.55 0.76 0.72 0.97 ]
>>> users
['User2' 'User3' 'User2' 'User3' 'User0' 'User1' 'User4' 'User4' 'User4']
A 'pure numpy' solution might use a combination of np.unique and np.bincount:
import numpy as np
pred = [0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97]
users = ['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 'User4',
'User4', 'User4']
# assign integer indices to each unique user name, and get the total
# number of occurrences for each name
unames, idx, counts = np.unique(users, return_inverse=True, return_counts=True)
# now sum the values of pred corresponding to each index value
sum_pred = np.bincount(idx, weights=pred)
# finally, divide by the number of occurrences for each user name
mean_pred = sum_pred / counts
print(unames)
# ['User0' 'User1' 'User2' 'User3' 'User4']
print(mean_pred)
# [ 0.45 0.55 0.55 0.435 0.81666667]
If you have pandas installed, DataFrames have some very nice methods for grouping and summarizing data:
import pandas as pd
df = pd.DataFrame({'name':users, 'pred':pred})
print(df.groupby('name').mean())
# pred
# name
# User0 0.450000
# User1 0.550000
# User2 0.550000
# User3 0.435000
# User4 0.816667
If you want to stick to numpy, the simplest is to use np.unique and np.bincount:
>>> pred = np.array([0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97])
>>> users = np.array(['User2', 'User3', 'User2', 'User3', 'User0', 'User1',
... 'User4', 'User4', 'User4'])
>>> unq, idx, cnt = np.unique(users, return_inverse=True, return_counts=True)
>>> avg = np.bincount(idx, weights=pred) / cnt
>>> unq
array(['User0', 'User1', 'User2', 'User3', 'User4'],
dtype='|S5')
>>> avg
array([ 0.45 , 0.55 , 0.55 , 0.435 , 0.81666667])
A compact solution is to use numpy_indexed (disclaimed: I am its author), which implements a solution similar to the vectorized one proposed by Jaime; but with a cleaner interface and more tests:
import numpy_indexed as npi
npi.group_by(users).mean(pred)
I have a function which creates a NumPy array from a data file. I want to then get the maximum value in the array and the index of that value:
import numpy as np
def dostuff():
# open .txt file into lists
# copy lists into numpy array
# nested for loops and values copied into numpy array called a
print a
print np.max(a)
print np.argmax(a)
dostuff()
Running this gives:
[[ 0.64 0.47 0.22 0.1 0.05 0.02]
[ 2.19 9.13 10.68 6.44 3.36 1.77]
[ 1.84 8.81 12.6 8.31 4.45 2.35]]
2.35
0
Clearly something has gone wrong with the np.max() and np.argmax(). This can be shown with the following code
def test():
a = np.array([[0.64, 0.47, 0.22, 0.1, 0.05, 0.02],
[2.19, 9.13, 10.68, 6.44, 3.36, 1.77],
[1.84, 8.81, 12.6, 8.31, 4.45, 2.35]])
print a
print np.max(a)
print np.argmax(a)
test()
This gives:
[[ 0.64 0.47 0.22 0.1 0.05 0.02]
[ 2.19 9.13 10.68 6.44 3.36 1.77]
[ 1.84 8.81 12.6 8.31 4.45 2.35]]
12.6
14
...which is what I would have expected. I have no idea why these two (apparently) identical arrays give different results here. Does anyone know what I may have done?