How to feed random numbers as indices to pandas data frame? - python

I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.

Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.

Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])

Related

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Filtering after Multi Indexing in pandas iterable indexing

I want to make a subset of my dataFrame object using pandas or any other python liberary using Hierarchical indexing that can be iterable depending on number of rows I have in one of the column.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv(address)
trajectory frame x y
1 1 447,956 2,219
1 2 447,839 2,327
1 3 449,183 1,795
1 4 450,444 1,833
1 5 448,514 1,708
1 6 451,532 1,832
1 7 448,471 1,759
1 8 450,028 2,097
1 9 448,215 2,203
1 10 449,311 2,063
1 11 451,745 1,76
1 12 450,827 2,264
1 13 448,991 2,208
1 14 452,829 3,106
1 15 448,688 1,77
1 16 449,844 1,951
1 17 450,044 1,991
1 18 449,835 1,901
1 19 450,793 3,49
1 20 449,618 2,354
2 1 445.936 7.219
2 2 442.879 3.327
3 1 441.283 9.795
4 1 447.956 2.219
4 3 447.839 2.327
4 6 449.183 1.795
In this DataFrame, let say there are 4 columns, names: 'trajectory', 'frame, 'x' and 'y'. Number of 'trajectory' can be different from one dataframe to another. Each 'trajectory' can have multiple frames between 1 and 20, where they can be sequential from 1-20 or with some missing frames as well. Each frame has its own value in the column 'x' and 'y'.
My aim is to create a new dataframe where I can have only those 'trajectory' where the 'frame' values is present for all the 20 rows. As the number of rows in 'trajectory' and 'frame' columns are changing, so I would like to have a code that can be used in such conditions.
df_1 = df.set_index(['trajectory','frame'], drop=False)
Here, I did a heirarchical indexing using 'trajectory' and 'frame' and then I found that 'trajectory' number 1 and 6 have 20 frames in them. So I could manually select them using the following code.
df_1_subset = df_1[(df1['traj']== 1)|(df1['trajectory']== 6)]
However, I have multiple csv files where in each Dataframe, the 'trajectory' that will have 20 rows in the 'frame' column will be different, so I will have to do this manually. I think, there must be a better way, but I just can not seem to find it. I am very new to coding and I would really appreciate anybody's help. Thank you very much in advance.
Iif need filter trajectory level for 1 or 6 use Index.get_level_values with Index.isin:
df_1_subset = df_1[df1.index.get_level_values('trajectory').isin([1,6])]
If need filter trajectory level for 1 and frame for 6 select with DataFrame.loc with tuple:
df_1_subset = df_1.loc[(1, 6)]
Alternative:
df_1_subset = df_1.loc[(df1.index.get_level_values('trajectory') == 1) |
(df1.index.get_level_values('frame') == 6)]

vectorise nested iterations by using groupby methods

I have written code to iterate through a dataset that has a demarcation column. This column consist of a value shared by all equally demarked rows. The code iterate through each demarcated section with a nested loop to iterate through each line, finding the nearest neighbor for each row in its respective demarcated block
import pandas as pd
import numpy as np
Create a df with XYZ and Section demark
p=5
df = pd.DataFrame(np.random.randn(100, 3), columns=list('XYZ'))
df2 = df.sort('Z')
df2 = df2.reset_index(drop=True)
df2['Section_demark'] = (df2.index/p).astype('int')
df2.head(15)
X Y Z Section_demark
0 -1.125526 -0.249091 -2.505444 0
1 0.710114 1.357477 -2.195904 0
2 -0.580319 -0.997311 -2.031280 0
3 1.311526 -0.268590 -1.741079 0
4 0.481450 0.448904 -1.546278 0
5 -1.820224 -0.846628 -1.392700 1
6 0.528618 0.418862 -1.388170 1
7 0.360560 -0.309429 -1.319548 1
8 -0.369107 -1.290528 -1.233815 1
9 0.139063 0.045076 -1.209820 1
10 0.049387 1.087300 -1.188375 2
11 0.678247 -1.191882 -1.172214 2
12 -0.976294 -0.752081 -1.092286 2
13 0.875952 0.319304 -1.079185 2
14 0.469730 -0.329548 -1.044178 2
Function for euclidean distance
def eucl_d(item_id):
a = df3.sub(df3.iloc[item_id], axis=1)
b = np.sum( np.square(a), axis=1 )
return b
Iterate through the section demarks, iterate through the lines in each Section_demark and find nearest neighbor,
Isolate the row nearest to top row and create a series, take the ix for that series and compile a list from it.
read the list back to df2, creating a new column with the Nearest neighbor index number as value
s=0
elements = []
while s<(len(df2)/p):
df3 = df2[df2['Section_demark']==(s)]
r=0
while r<(p):
df4=df3.copy()
df4['dist'] = eucl_d(r)
df4 = df4.sort('dist')
ser = df4.iloc[1]
elements.append(ser.name)
r=r+1
s=s+1
df2["NNIX"] = elements
df2.head(10)
X1 Y1 Z1 NNIX
0 0.002299 1.284195 -1.604009 1
1 -0.444305 0.346856 -2.396538 0
2 -0.490741 -1.416682 -1.423573 3
3 0.203635 -0.676841 -1.596332 2
4 0.002299 1.284195 -1.604009 1
5 -0.314330 0.036554 -1.153127 6
6 -0.387839 0.129000 -1.235331 5
7 -0.314330 0.036554 -1.153127 6
8 -0.059477 -0.205260 -1.136376 7
9 0.717980 0.130665 -1.040372 8
I would like to exchange the last section of iteration with a groupby command and use aggregate or apply to run the eucl_d function, but it eludes me
I can get df2 grouped by running this:
grouped = df3.groupby('Section_demark')
Its the second step that is giving me trouble
I was thinking:
grouped.agg(eucl_d(item_id))
But I dont know how to specify the item_id for eucl_d(item_id)

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories