This question already has answers here:
Random row selection in Pandas dataframe
(6 answers)
Closed 6 years ago.
I am trying to build an algorithm for finding number of clusters. I need to assign random points from a data set as initial means.
I first tried the following code :
mu=random.sample(df,10)
it gave index out of range error.
I converted the same into a numpy array and then did
mu=random.sample(np.array(df).tolist(),10)
instead of giving 10 values as mean it is giving me 10 arrays of values.
How can I get a 10 values to initialise as mean for 10 clusters from the dataframe?
Use numpy.random.choice
df.iloc[np.random.choice(np.arange(len(df)), 10, False)]
Or numpy.random.permutation
df.loc[np.random.permutation(df.index)[:10]]
a b c
11 2 9 9
1 7 7 0
16 5 1 8
15 0 8 2
17 1 5 4
19 5 0 9
10 7 7 0
8 4 4 3
6 6 2 4
14 7 6 2
I think you need DataFrame.sample:
mu = df.sample(10)
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(20,3)), columns=list('abc'))
print (df)
a b c
0 8 8 3
1 7 7 0
2 4 2 5
3 2 2 2
4 1 0 8
5 4 0 9
6 6 2 4
7 1 5 3
8 4 4 3
9 7 1 1
10 7 7 0
11 2 9 9
12 3 2 5
13 8 1 0
14 7 6 2
15 0 8 2
16 5 1 8
17 1 5 4
18 2 8 3
19 5 0 9
mu = df.sample(10)
print (mu)
a b c
11 2 9 9
1 7 7 0
8 4 4 3
5 4 0 9
2 4 2 5
19 5 0 9
13 8 1 0
14 7 6 2
0 8 8 3
9 7 1 1
Related
Consider a DataFrame with only one column named values.
data_dict = {values:[5,4,3,8,6,1,2,9,2,10]}
df = pd.DataFrame(data_dict)
display(df)
The output will look something like:
values
0 5
1 4
2 3
3 8
4 6
5 1
6 2
7 9
8 2
9 10
I want to generate a new column that will have the trailing high value of the previous column.
Expected Output:
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
Right now I am using for loop to iterate on df.iterrows() and calculating the values at each row. Because of this, the code is very slow.
Can anyone share the vectorization approach to increase the speed?
Use .cummax:
df["trailing_high"] = df["values"].cummax()
print(df)
Output
values trailing_high
0 5 5
1 4 5
2 3 5
3 8 8
4 6 8
5 1 8
6 2 8
7 9 9
8 2 9
9 10 10
I have a dataframe as below
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to multiply every 3rd column after the 2 column in the last 2 rows by 5 to get the ouput as below.
How to acomplish this?
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 10 3 4 25 6 7 40 9
1 10 3 4 25 6 7 40 9
I am able to select the cells i need with df.iloc[-2:,1::3]
which results in the df as below but I am not able to proceed further.
B E H
2 5 8
2 5 8
I know that I can select the same cells with loc instead of iloc, then the calcualtion is straign forward, but i am not able to figure it out.
The column names & cell values CANNOT Be used since these change (the df here is just a dummy data)
You can assign back to same selection of rows/ columns like:
df.iloc[-2:,1::3] = df.iloc[-2:,1::3].mul(5)
#alternative
#df.iloc[-2:,1::3] = df.iloc[-2:,1::3] * 5
print (df)
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9
2 1 2 3 4 5 6 7 8 9
3 1 2 3 4 5 6 7 8 9
4 1 2 3 4 5 6 7 8 9
5 1 10 3 4 25 6 7 40 9
6 1 10 3 4 25 6 7 40 9
I want to randomly assign 1 value to the IsShade column (output) such that value 1 can be assigned only D times (see column Shading for ex 2 times or 5 times or 3 times) and have to iterate it for E times (Total column for ex 6 times or 8 times or 5 times)
There are 1 million rows of dataset and attached is sample input and image.
Input:
In[1]:
Sr Series Parallel Shading Total Cell
0 0 3 2 2 6 1
1 1 3 2 2 6 2
2 2 3 2 2 6 3
3 3 3 2 2 6 4
4 4 3 2 2 6 5
5 5 3 2 2 6 6
6 6 4 2 5 8 1
7 7 4 2 5 8 2
8 8 4 2 5 8 3
9 9 4 2 5 8 4
10 10 4 2 5 8 5
11 11 4 2 5 8 6
12 12 4 2 5 8 7
13 13 4 2 5 8 8
14 14 5 1 3 5 1
15 15 5 1 3 5 2
16 16 5 1 3 5 3
17 17 5 1 3 5 4
18 18 5 1 3 5 5
If you can help me in how to achieve or python code that will be helpful. Thank you and appreciate it.
Example Expected Output:
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 1
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 0
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 0
8 8 4 2 5 8 3 1
9 9 4 2 5 8 4 1
10 10 4 2 5 8 5 0
11 11 4 2 5 8 6 0
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 1
14 14 5 1 3 5 1 0
15 15 5 1 3 5 2 1
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1
You can create a new column that does a .groupby and randomly selects x number of rows based off the integer in the Shading column using .sample. From there, I returned True or False and converted to an integer (True becomes 1 and False becomes 0 with .astype(int)):
s = df['Series'].ne(df['Series'].shift()).cumsum() #s is a unique identifier group
df['IsShade'] = (df.groupby(s, group_keys=False)
.apply(lambda x: x['Shading'].sample(x['Shading'].iloc[0])) > 0)
df['IsShade'] = df['IsShade'].fillna(False).astype(int)
df
Out[1]:
Sr Series Parallel Shading Total Cell IsShade
0 0 3 2 2 6 1 0
1 1 3 2 2 6 2 0
2 2 3 2 2 6 3 0
3 3 3 2 2 6 4 0
4 4 3 2 2 6 5 1
5 5 3 2 2 6 6 1
6 6 4 2 5 8 1 1
7 7 4 2 5 8 2 1
8 8 4 2 5 8 3 0
9 9 4 2 5 8 4 0
10 10 4 2 5 8 5 1
11 11 4 2 5 8 6 1
12 12 4 2 5 8 7 1
13 13 4 2 5 8 8 0
14 14 5 1 3 5 1 1
15 15 5 1 3 5 2 0
16 16 5 1 3 5 3 0
17 17 5 1 3 5 4 1
18 18 5 1 3 5 5 1
Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
I have a data frame as shown below:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'ouptput':[10,11,12,13,14]})
Data
Yields,
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6 14
I want to loop through the data to remove n number of values from the column 'output' in above Data, where n = [1,2,3,4] and assign it to a new data frame 'Test_Data'. For example if I assign n = 2 the function should produce
Test_Data - iteration 1 as
L1 L2 ouptput
0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
Test_Data - iteration 2 as
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
like wise it should produce a data frame with 2 values removed from the 'output' column in data frame. It should produce a new output (new combination) everytime. No output should be repeated. Also I should have control over the number of iterations. Eample 5c3 has 10 possible combinations. But I should be able to stop it at 8 iterations.
This is not a great solution but will probably help you achieve what you are looking for:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'output':[10,11,12,13,14]})
num_iterations = 1
num_values = 3
for i in range(0, num_iterations):
tmp_data = Data.copy()
tmp_data.loc[i*num_values:num_values*(i+1)-1, 'output'] = ''
print tmp_data
This gives you a concatenated dataframe with every combination using pd.concat and itertools.combinations
from itertools import combinations
import pandas as pd
def mask(df, col, idx):
d = df.copy()
d.loc[list(idx), col] = ''
return d
n = 2
pd.concat({c: mask(Data, 'ouptput', c) for c in combinations(Data.index, n)})
L1 L2 ouptput
0 1 0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
2 0 1 6
1 2 7 11
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6
1 2 7 11
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6
1 2 0 1 6 10
1 2 7
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6 10
1 2 7
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7
2 3 3 12
3 4 5 13
4 5 6
2 3 0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7 11
2 3 3
3 4 5 13
4 5 6
3 4 0 1 6 10
1 2 7 11
2 3 3 12
3 4 5
4 5 6