how to make a sample for a grouped dataframe in pandas - python

my question is to make a sample for a grouped dataframe in pandas. I just grouped a dataset using pd.groupby, and the grouped dataset is like this, each bikeid has several trips:
bikeid tripid A B C
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
1 4 a4 b4 c4
5 a5 b5 c5
2 6 ..............
7 ..............
3 8 ..............
9 ..............
What I want to do is just build a sample pick up one bikeid from every 3 bikeids. Should be like:
bikeid tripid A B C
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
3 8 a8 b8 c8
9 a9 b9 c9
6 ..............
..............
9
...
However I tried to use grouped_new = grouped.sample(frac=0.3)
it comes out a sample based on each single trip but not base on bikeid.
Can anyone help me out with this? Thank you so much!!!

Assuming you have a MultiIndex then consider using df.reindex() as shown below:
# your original dataframe
df = pd.read_json('{"bikeid": {"0":0,"1":0,"2":0,"3":1,"4":1,"5":2,"6":2,"7":3,"8":3},"tripid": {"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8,"8":9},"A":{"0":"a1","1":"a2","2":"a3","3":"a4","4":"a5","5":"a6","6":"a7","7":"a8","8":"a9"},"B":{"0":"b1","1":"b2","2":"b3","3":"b4","4":"b5","5":"b6","6":"b7","7":"b8","8":"b9"},"C":{"0":"c1","1":"c2","2":"c3","3":"c4","4":"c5","5":"c6","6":"c7","7":"c8","8":"c9"}}')
df.set_index(['bikeid', 'tripid'], inplace=True)
# df looks like the following
A B C
bikeid tripid
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
1 4 a4 b4 c4
5 a5 b5 c5
2 6 a6 b6 c6
7 a7 b7 c7
3 8 a8 b8 c8
9 a9 b9 c9
# define the labels you want to get out of your indexing operation
index_labels = np.arange(0, 100, 3)
# do the indexing
df.reindex(index_labels, level='bikeid')
# returns the following
A B C
bikeid tripid
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
3 8 a8 b8 c8
9 a9 b9 c9

Related

Compare result to the previous result where after grouping

I have a df like this:
A B C D E F
2 a1 a2 a3 a4 100
2 a1 b2 c3 a4 100 # note
2 b1 b2 b3 b4 100
2 c1 c2 c3 c4 100
1 a1 a2 a3 a4 120
2 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
I want to compare the results of F column where the columns B-E match based on A column like so:
A B C D E F diff
2 a1 a2 a3 a4 100 120/100
2 a1 b2 c3 a4 100 # note 150/100
2 b1 b2 b3 b4 100 130/100
2 c1 c2 c3 c4 100 110/100
1 a1 a2 a3 a4 120 80/120
1 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
Since the first line has the same values in the first line where A is 1 I do 120/100.
What I've tried:
df.groupby(['B',' 'C', 'D', 'E']) - this groups the data, but I don't know how I could apply the logic of calculating the previous value of column A. Or maybe there is a simpler way of achieving it.
Use DataFrameGroupBy.shift with Series.div:
df['d'] = df.groupby(['B', 'C', 'D', "E"])['F'].shift(-1).div(df['F'])
print (df)
A B C D E F d
0 2 a1 a2 a3 a4 100 1.200000
1 2 a1 b2 c3 a4 100 1.500000
2 2 b1 b2 b3 b4 100 1.300000
3 2 c1 c2 c3 c4 100 1.100000
4 1 a1 a2 a3 a4 120 0.666667
5 2 a1 b2 c3 a4 150 NaN
6 1 b1 b2 b3 b4 130 NaN
7 1 c1 c2 c3 c4 110 NaN
8 0 a1 a2 a3 a4 80 NaN

Combine multiple columns and rows into one column

I am trying to combine multiple columns and rows into one column. The order does from left to right in row 1 then goes to row 2 and follow the sequence again. Here is an example:
Current df:
Shop A B C D E
1 A1 B1 C1 D1 E1
2 A2 B2 C2 D2 E2
3 A3 B3 C3 D3 E3
4 A4 B4 C4 D4 E4
5 A5 B5 C5 D5 E5
Desired output:
Combined
A1
B1
C1
D1
E1
A2
B2
C2
.
.
.
D5
E5
Any help would be appreciated!! Thanks in advance!!
If Shop is index name, no column use DataFrame.stack with DataFrame.reset_index and Series.to_frame:
df = df.stack().reset_index(drop=True).to_frame('Combined')
print (df)
Combined
0 A1
1 B1
2 C1
3 D1
4 E1
5 A2
6 B2
7 C2
8 D2
9 E2
10 A3
11 B3
12 C3
13 D3
14 E3
15 A4
16 B4
17 C4
18 D4
19 E4
20 A5
21 B5
22 C5
23 D5
24 E5
else if there is column Shop use DataFrame.set_index in first step:
df = df.set_index('Shop').stack().reset_index(drop=True).to_frame('Combined')
You can use transpose and then melt:
df = df.set_index("Shop") # should not be needed
df.transpose().melt()
Output:
Shop value
0 1 A1
1 1 B1
2 1 C1
3 1 D1
4 1 E1
5 2 A2
6 2 B2
7 2 C2
8 2 D2
9 2 E2
10 3 A3
...
And you can now of course drop 'shop', it you'd like to:
df.transpose().melt().drop("Shop", axis=1)
Output:
value
0 A1
1 B1
2 C1
3 D1
4 E1
5 A2
6 B2
7 C2
8 D2
9 E2
10 A3
...

Concatenating Pandas DataFrames with overlapping indexes but never overlapping values

I have two DataFrames of arbitrary shape of the type:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 NaN
3 A3 NaN NaN
4 A4 NaN NaN
and
A B C
2 NaN NaN C2
3 NaN B3 C3
4 NaN B4 C4
5 A5 B5 C5
6 A6 B6 C6
The two DataFrames have overlapping indexes. Where there is an overlap, for a given column, there is a non-NaN in one DataFrame, and a NaN in the other. How can I concatenate these such that I can achieve a DataFrame with all values and no NaNs:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
My proposed solution is:
df3 = pd.concat([pd.concat([df1[col].dropna(), df2[col].dropna()]) for col in df1.columns], axis=1)
However, ideally I would not work column-by-column.
Use combine_first:
df = df1.combine_first(df2)
print(df)
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
using df.fillna() and df.append() with dropna()
df1.fillna(df2).append(df2).dropna()
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6

Transform a dataframe to a column for another.dataframe pandas PYTHON

I need to merge together two tables in particular way:
First one in.this way:
Out[7]:
A B C D
a 1 4 4 4
b 4 5 5 5
c 3 6 6 6
d 8 7 7 7
Second.one like this:
Out[8]:
T1 T2 C2 D2 S2.
a A4 B4 C4 D4. A.
a .............. B.
a .............. C
a .............. D
b A5 B5 C5 D5. A
b ............... B
..
..
c A6 B6 C6 D6. A
...
...
d A7 B7 C7 D7. A
d ............... B
...
I'm looking to do the following:
Out[8]:
T1 T2 C2 D2 S2. M
a A4 B4 C4 D4. A 1
a. A2. X4. C3. B8. B 4
a. S7. N4. R3. S7. C 4
.
.
.
b A5 B5 C5 D5. A 4
b. F4. F6. C8. A5. B 5
....
.....
c A6 B6 C6 D6. C. 6
.....
......
d A7 B7 C7 D7. D. 7
......
.......
GOAL: Looking on values for the first column in the first and second dataframe(a,b,c,d) make a new column with values from first dataframe when the NAME COLUMN in the first df corresponds to the value ROW'S in column S2 of the second . ITS NOT A DIAGONAL
I think need:
df = (df2.set_index('S2', append=True)
.rename_axis((None, None))
.join(df1.stack().rename('M'))
.reset_index(level=1)
.rename(columns={'level_1':'S2'}))
print (df)
S2 T1 T2 C2 D2 M
a A A4 B4 C4 D4. 1
b B A5 B5 C5 D5. 5
c C A6 B6 C6 D6. 6
d D A7 B7 C7 D7. 7

Python Pandas: How to replace values in a Dataframe based on another array in conditional base

I have a DataFrame as follows. Both columns have Member_ID which indicates which Member_ID connected with other Member_ID
col1 col2
1 3
1 4
1 5
2 3
2 4
3 1
3 2
3 5
4 1
4 2
5 1
5 3
and I have calculated each Member_ID connected with how many Member_ID. For example Member_ID 1 is connected with 3 Member_ID. If an Member_ID contains more or equal to 3 connections we have to put "a" in front of the Member_Id else we have to put "b" so the label we have to give the label as "a1" for Member_ID 1.
Likewise I have calculated the labels for each Member_Id and the label array is below.
member_ID No_of_con Label
1 3 a1
2 2 b2
3 3 a3
4 2 b4
5 2 b5
Now I have to replace the first Dataframe's values referring from the label array. Dataframe is big for using for loops is not efficient So how can i achive this using Pandas in simpler way? I'm expecting the result as below
col1 col2
a1 a3
a1 b4
a1 b5
b2 a3
b2 b4
a3 a1
a3 b2
a3 b5
b4 a1
b4 b2
b5 a1
b5 a3
we can stack, map and unstack:
In [9]: d1.stack().map(d2.set_index('member_ID')['Label']).unstack()
Out[9]:
col1 col2
0 a1 a3
1 a1 b4
2 a1 b5
3 b2 a3
4 b2 b4
5 a3 a1
6 a3 b2
7 a3 b5
8 b4 a1
9 b4 b2
10 b5 a1
11 b5 a3
Or you can try this
df2.set_index('member_ID',inplace=True)
df1.apply(lambda x: x.map(df2['Label']))
col1 col2
0 a1 a3
1 a1 b4
2 a1 b5
3 b2 a3
4 b2 b4
5 a3 a1
6 a3 b2
7 a3 b5
8 b4 a1
9 b4 b2
10 b5 a1
11 b5 a3
You can use pd.DataFrame.replace using a pd.Series in a dictionary context.
d1.replace(d2.set_index('member_ID').Label)
col1 col2
0 a1 a3
1 a1 b4
2 a1 b5
3 b2 a3
4 b2 b4
5 a3 a1
6 a3 b2
7 a3 b5
8 b4 a1
9 b4 b2
10 b5 a1
11 b5 a3

Categories