Given an example dataframe:
import pandas as pd
import numpy as np
values = np.array([
[0, 0.5, 1, 0, 0, 3],
[1, 0, 0, 1, 1, 0 ],
[0, 0.5, 0, 0, 2, 1],
[0, 0, 0, 0, 4, 0],
])
indexes= 'a','b','c','d'
columns='ab','bc','cd','de','ef','fg'
df = pd.DataFrame(index=indexes,columns=columns, data=values)
print(df)
which looks like:
ab bc cd de ef fg
a 0.0 0.5 1.0 0.0 0.0 3.0
b 1.0 0.0 0.0 1.0 1.0 0.0
c 0.0 0.5 0.0 0.0 2.0 1.0
d 0.0 0.0 0.0 0.0 4.0 0.0
desired output:
ab bc cd de ef fg
a 0.0 0.5 1.0 0.0 0.0 3.0
b 1.0 0.0 0.0 1.0 1.0 0.0
c 0.0 0.5 0.0 0.0 2.0 1.0
d 0.0 0.0 0.0 0.0 4.0 0.0
e Nan Nan Nan NAn 7.0 4.0
Is it somehow possible to add a row where are displayed only the sums of the two last columns? (Of course, below the respective columns)
Thanks for your attention!
Edit: ohhh. Thanks for the clarification. You create a new row and assign that with the sum of the last two columns. The iloc indexer is of the format [row,col]. So we want : all rows but only the final two columns -2:.
df.loc['e'] = df.iloc[:,-2:].sum()
Result:
>>> df
ab bc cd de ef fg
a 0.0 0.5 1.0 0.0 0.0 3.0
b 1.0 0.0 0.0 1.0 1.0 0.0
c 0.0 0.5 0.0 0.0 2.0 1.0
d 0.0 0.0 0.0 0.0 4.0 0.0
e NaN NaN NaN NaN 7.0 4.0
Old answer:
I assume you mean the last two rows...
You can use pd.concat here
pd.concat([df,df.iloc[-2,:] + df.iloc[-1:]])
Result:
>>> pd.concat([df,df.iloc[-2,:] + df.iloc[-1:]])
ab bc cd de ef fg
a 0.0 0.5 1.0 0.0 0.0 3.0
b 1.0 0.0 0.0 1.0 1.0 0.0
c 0.0 0.5 0.0 0.0 2.0 1.0
d 0.0 0.0 0.0 0.0 4.0 0.0
d 0.0 0.5 0.0 0.0 6.0 1.0
You can use loc
df.loc['sum_c_d'] = df[-2:].sum()
ab bc cd de ef fg
a 0 0.5 1 0 0 3
b 1 0.0 0 1 1 0
c 0 0.5 0 0 2 1
d 0 0.0 0 0 4 0
sum_c_d 0 0.5 0 0 6 1
Related
I've spent a little while on this and got an answer but seems a little convoluted so curious if people have a better solution.
Given a list I want a table indicating all the possible combinations between the elements.
sample_list = ['a', 'b', 'c', 'd']
(pd.concat(
[
pd.DataFrame(
[dict.fromkeys(i, 1) for i in combinations(sample_list, j)]
) for j in range(len(sample_list)+1)
]).
fillna(0).
reset_index(drop = True)
)
With the result, as desired:
a b c d
0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0
5 1.0 1.0 0.0 0.0
6 1.0 0.0 1.0 0.0
7 1.0 0.0 0.0 1.0
8 0.0 1.0 1.0 0.0
9 0.0 1.0 0.0 1.0
10 0.0 0.0 1.0 1.0
11 1.0 1.0 1.0 0.0
12 1.0 1.0 0.0 1.0
13 1.0 0.0 1.0 1.0
14 0.0 1.0 1.0 1.0
15 1.0 1.0 1.0 1.0
For learning purposes would like to know better solutions.
Thanks
Check Below code
import itertools
import pandas as pd
sample_list = ['a', 'b', 'c', 'd']
pd.DataFrame(list(itertools.product([0, 1], repeat=len(sample_list))), columns=sample_list)
Output:
Halo,
Im a newbie using a python
First of all, i want to normalize my code. There is no problem when i normalized "datalatih" but there is accident when i tried to normalize called "datauji". I already make a different variable to normalize both of them.
Here is my data:
df = pd.read_csv("datalatihnodummy.csv", sep=';')
where my datalatih [:6] and datauji [6:]
Here is my sucessfuly code :
minperfeature = []
maxperfeature = []
for i in range(len(data.columns)):
minperfeature.append(min(data[data.columns[i]]))
maxperfeature.append(max(data[data.columns[i]]))
print(minperfeature)
print(maxperfeature)
datanormalisasi = datalatih
for i in range(len(datalatih.index)):
for j in range(len(datalatih.columns)):
datanormalisasi.loc[i, datalatih.columns[j]] = (datanormalisasi.loc[i, datalatih.columns[j]] - minperfeature[j]) / (maxperfeature[j] - minperfeature[j])
datanormalisasi
[12, 17, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[48, 135, 623, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0
0.638889
0.652542
0.409165
0.0
1.0
0.0
1.0
1.0
1.0
0.0
1.0
1.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
1
0.000000
0.305085
0.409165
1.0
1.0
0.0
1.0
1.0
1.0
0.0
1.0
0.0
0.0
0.0
1.0
0.0
1.0
1.0
1.0
1.0
2
0.527778
0.500000
0.274959
1.0
1.0
1.0
0.0
1.0
0.0
0.0
1.0
0.0
1.0
0.0
0.0
1.0
1.0
0.0
1.0
0.0
3
0.666667
0.042373
0.016367
0.0
1.0
0.0
1.0
1.0
1.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
1.0
1.0
0.0
1.0
4
0.277778
0.000000
0.000000
0.0
1.0
1.0
1.0
1.0
0.0
0.0
1.0
0.0
1.0
1.0
0.0
0.0
1.0
0.0
1.0
1.0
5
1.000000
0.025424
0.018003
1.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
0.0
1.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
Here is my error code:
datanormalisasiUji = datauji
for i in range(len(datauji.index)):
for j in range(len(datauji.columns)):
datanormalisasiUji.loc[i, datauji.columns[j]] = (datanormalisasiUji.loc[i, datauji.columns[j]] - minperfeature[j]) / (maxperfeature[j] - minperfeature[j])
datanormalisasiUji
the result was
'the label [0] is not in the [index]'
Idk where is error show in my code, and i already solve my problem using different variables but still can't
Anyone know how solutions to solve it? Thanks before ^^
Let's say that you want to normalize the columns: [u'Umur', u'ALT/SGOT', u'AST/SGPT', u'Anoreksia', u'Mual'] since they contain numerical values.
For the min-max normalization use this:
df = pd.read_csv("datalatihnodummy.csv", sep=';')
df_new = df.iloc[:,1:6]
df_new.head(3)
Umur ALT/SGOT AST/SGPT Anoreksia Mual
0 35 94 262 0 1
1 12 53 262 1 1
2 31 76 180 1 1
results = df_new - df_new.min() / (df_new.max() - df_new.min())
results
Umur ALT/SGOT AST/SGPT Anoreksia Mual
0 34.666667 93.855932 261.98036 0.0 1.0
1 11.666667 52.855932 261.98036 1.0 1.0
2 30.666667 75.855932 179.98036 1.0 1.0
3 35.666667 21.855932 21.98036 0.0 1.0
4 21.666667 16.855932 11.98036 0.0 1.0
5 47.666667 19.855932 22.98036 1.0 0.0
6 17.666667 134.855932 622.98036 1.0 1.0
7 41.666667 67.855932 11.98036 0.0 0.0
Explanation: Pandas is smart and by typing: df_new.min()it estimates the min value of all columns.
df_new.min()
Umur 12
ALT/SGOT 17
AST/SGPT 12
Anoreksia 0
Mual 0
dtype: int64`
Let's be given a data-frame like the following one:
import pandas as pd
import numpy as np
a = ['a', 'b']
b = ['i', 'ii']
mi = pd.MultiIndex.from_product([a,b], names=['first', 'second'])
A = pd.DataFrame(np.zeros([3,4]), columns=mi)
first a b
second i ii i ii
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
I would like to create new columns iii for all first-level columns and assign the value of a new array (of matching size). I tried the following, to no avail.
A.loc[:,pd.IndexSlice[:,'iii']] = np.arange(6).reshape(3,-1)
The result should look like this:
a b
i ii iii i ii iii
0 0.0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 2.0 0.0 0.0 3.0
2 0.0 0.0 4.0 0.0 0.0 5.0
Since you have multiple index in columns , I recommend create the additional append df , then concat it back
appenddf=pd.DataFrame(np.arange(6).reshape(3,-1),
index=A.index,
columns=pd.MultiIndex.from_product([A.columns.levels[0],['iii']]))
appenddf
a b
iii iii
0 0 1
1 2 3
2 4 5
A=pd.concat([A,appenddf],axis=1).sort_index(level=0,axis=1)
A
first a b
second i ii iii i ii iii
0 0.0 0.0 0 0.0 0.0 1
1 0.0 0.0 2 0.0 0.0 3
2 0.0 0.0 4 0.0 0.0 5
Another workable solution
for i,x in enumerate(A.columns.levels[0]):
A[x,'iii']=np.arange(6).reshape(3,-1)[:,i]
A
first a b a b
second i ii i ii iii iii
0 0.0 0.0 0.0 0.0 0 1
1 0.0 0.0 0.0 0.0 2 3
2 0.0 0.0 0.0 0.0 4 5
# here I did not add `sort_index`
I am working with a large array of 1's and need to systematically remove 0's from sections of the array. The large array is comprised of many smaller arrays, for each smaller array I need to replace its upper and lower triangles with 0's systematically. For example we have an array with 5 sub arrays indicated by the index value (all sub-arrays have the same number of columns):
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
4 1.0 1.0 1.0
I want each group of rows to be modified in its upper and lower triangle such that the resulting matrix is:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
At the moment I am using only numpy to achieve this resulting array, but I think I can speed it up using Pandas grouping. In reality my dataset is very large almost 500,000 rows long. The numpy code is below:
import numpy as np
candidateLengths = np.array([1,2,3,4,5])
centroidLength =3
smallPaths = [min(l,centroidLength) for l in candidateLengths]
# This is the k_values of zeros to delete. To be used in np.tri
k_vals = list(map(lambda smallPath: centroidLength - (smallPath), smallPaths))
maskArray = np.ones((np.sum(candidateLengths), centroidLength))
startPos = 0
endPos = 0
for canNo, canLen in enumerate(candidateLengths):
a = np.ones((canLen, centroidLength))
a *= np.tri(*a.shape, dtype=np.bool, k=k_vals[canNo])
b = np.fliplr(np.flipud(a))
c = a*b
endPos = startPos + canLen
maskArray[startPos:endPos, :] = c
startPos = endPos
print(maskArray)
When I run this on my real dataset it takes nearly 5-7secs to execute. I think this is down to this massive for loop. How can I use pandas groupings to achieve a higher speed? Thanks
New Answer
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
idx = np.append(df.index.values, -1)
w = np.append(-1, np.flatnonzero(idx[:-1] != idx[1:]))
c = np.diff(w)
df * np.vstack([tris(n, 3) for n in c])
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
Old Answer
I define some helper triangle functions
def tris(n, m):
if n < m:
a = np.tri(m, n, dtype=int).T
else:
a = np.tri(n, m, dtype=int)
return a * a[::-1, ::-1]
def tris_df(df):
n, m = df.shape
return pd.DataFrame(tris(n, m), df.index, df.columns)
Then
df * df.groupby(level=0, group_keys=False).apply(tris_df)
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 0.0
1 0.0 1.0 1.0
2 1.0 0.0 0.0
2 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
3 1.0 1.0 0.0
3 0.0 1.0 1.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
4 1.0 1.0 0.0
4 1.0 1.0 1.0
4 0.0 1.0 1.0
4 0.0 0.0 1.0
I've got a series of measurements in a 2D array such as
T mu1 mu2 mu3 a b c d e
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 3.0 0.0 0.0 0.0 0.0 0.0
where T, mu1, mu2 and mu3 are the 4 axes of the variables I control (independent variables). a, b, c, d and e are the measurements I've made (dependent variables).
I would like to convert this 2D array into a 5D array in numpy. By specifying T, mu1, mu2 and mu3 (or at least their 4 indexes) I want to be able to retrieve the corresponding a, b, c, d and e values.
Is there a straightforward way to reshape this kind of array by specifying what columns the axes correspond to? The MultiIndex in Pandas seemed to smartly organize it in a table, but seems ill-suited for high dimensional arrays. I won't necessarily know ahead of time what the shape of the ndarray should be, but it seems to me that based on the values it should be possible to reshape the array properly. The increment values for each axis might also be different, but they will always be uniform.
My current idea involves ignoring the mu1, mu2 and mu3 columns, and stacking sets of T data into a 3D array. From there I would stack sets of 3D mu1 data into a 4D array, and repeat the process with mu2 and mu3. This seems like a tedious process that should have a simple solution though.
First, let's make some fake data:
# an N x 5 array containing a regular mesh representing the stimulus params
stim_params = np.mgrid[:2, :3, :4, :5, :6].reshape(5, -1).T
# an N x 3 array representing the output values for each simulation run
output_vals = np.arange(720 * 3).reshape(720, 3)
# shuffle the rows for a bit of added realism
shuf = np.random.permutation(stim_params.shape[0])
stim_params = stim_params[shuf]
output_vals = output_vals[shuf]
Now you can use np.lexsort to get the set of indices that will sort the rows of your 2D array of simulation parameters such that the values in each column are in ascending order. Having done that, you can apply these indices to the rows of simulation output values.
# get the number of unique values for each stimulus parameter
params_shape = tuple(np.unique(col).shape[0] for col in stim_params.T)
# get the set of row indices that will sort the stimulus parameters in ascending
# order, starting with the final column
idx = np.lexsort(stim_params[:, ::-1].T)
# sort and reshape the stimulus parameters:
sorted_params = stim_params[idx].T.reshape((5,) + params_shape)
# sort and reshape the output values
sorted_output = output_vals[idx].T.reshape((3,) + params_shape)
I find that the hardest part is often just trying to wrap your head around what all the different dimensions of the outputs correspond to:
# array of stimulus parameters, with dimensions (n_params, p1, p2, p3, p4, p5)
print(sorted_params.shape)
# (5, 2, 3, 4, 5, 6)
# to check that the sorting worked as expected, we can look at the values of the
# 5th parameter when all the others are held constant at 0:
print(sorted_params[4, 0, 0, 0, 0, :])
# [0 1 2 3 4 5]
# ... and the 1st parameter when we hold all the others constant:
print(sorted_params[0, :, 0, 0, 0, 0])
# [0, 1]
# ... now let the 1st and 2nd parameters covary:
print(sorted_params[:2, :, :, 0, 0, 0])
# [[[0 0 0]
# [1 1 1]]
# [[0 1 2]
# [0 1 2]]]
Hopefully you get the idea. The same indexing logic applies to the sorted simulation outputs:
# array of outputs, with dimensions (n_outputs, p1, p2, p3, p4, p5)
print(sorted_output.shape)
# (3, 2, 3, 4, 5, 6)
# the first output variable whilst holding the first 4 simulation parameters
# constant at 0:
print(sorted_output[0, 0, 0, 0, 0, :])
# [ 0 3 6 9 12 15]