considering the following dataset:
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
id A B C Active
a 1 2 3 T
b 4 5 6 T
b 9 9 39 T
c 16 17 18 N
I need to augment each rows of each groups(id) by rows that the active = T , which means that
a 1 2 3 a 1 2 3
b 4 5 6 a 1 2 3
b 9 9 39 a 1 2 3
a 1 2 3 b 4 5 6
b 4 5 6 b 4 5 6
b 9 9 39 b 4 5 6
a 1 2 3 b 9 9 39
b 4 5 6 b 9 9 39
b 9 9 39 b 9 9 39
a 1 2 3 c 16 17 18
b 9 9 39 c 16 17 18
b 4 5 6 c 16 17 18
I have an idea which I could not implement it.
first, make a new dataset by filtering data.
take all rows that active column is equal to T and save it in a new df.
df_t = df [df['Active']=='T']
then for each rows of df add a new vector form df_t dataset.
which means that :
for sample in df:
for t in df_t:
df_new = sample + t ( vectors of df and df_t join together)
Df_new = concat(df_new,Df_new)
I really appreciate your comments and suggestion to implement my own idea!
You want the catersian cross product of df and df_t. You can do it with a bit of a hack like this:
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
Putting it all together:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.array([['a',1, 2, 3,'T'], ['b',4, 5, 6,'T'],
['b',9, 9, 39,'T'],
['c',16, 17 , 18,'N']])
, columns=['id','A', 'B', 'C','Active'])
df_t = df [df['Active']=='T']
df['cross'] = 1
df_t['cross'] = 1
df_new = pd.merge(df,df_t.drop('Active',axis=1),on='cross').drop('cross',axis=1)
results in:
>>> df_new
id_x A_x B_x C_x Active id_y A_y B_y C_y
0 a 1 2 3 T a 1 2 3
1 a 1 2 3 T b 4 5 6
2 a 1 2 3 T b 9 9 39
3 b 4 5 6 T a 1 2 3
4 b 4 5 6 T b 4 5 6
5 b 4 5 6 T b 9 9 39
6 b 9 9 39 T a 1 2 3
7 b 9 9 39 T b 4 5 6
8 b 9 9 39 T b 9 9 39
9 c 16 17 18 N a 1 2 3
10 c 16 17 18 N b 4 5 6
11 c 16 17 18 N b 9 9 39
Related
i do have a Pandas df like (df1):
0 1 2 3 4 5
0 a b c d e f
1 1 4 7 10 13 16
2 2 5 8 11 14 17
3 3 6 9 12 15 18
and i want to generate an Dataframe like (df2):
0 1 2
0 a b c
1 1 4 7
2 2 5 7
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
additional information about the given df:
shape of given df ist unknown. b = df1.shape() -> b = [n,m]
it is a given fact the width of df1 is divisble by 3
i did try stack, melt and wide_to_long. By using stack the order of the rows is lost, the rows should behave as shown in exmeplary df2 . I would really appreciate any help.
Kind regards Hans
Use np.vstack and np.hsplit:
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
Another example:
>>> df
0 1 2 3 4 5 6 7 8
0 a b c d e f g h i
1 1 4 7 10 13 16 19 22 25
2 2 5 8 11 14 17 20 23 26
3 3 6 9 12 15 18 21 24 27
>>> pd.DataFrame(np.vstack(np.hsplit(df, df.shape[1] / 3)))
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
8 g h i
9 19 22 25
10 20 23 26
11 21 24 27
You can use DataFrame.append:
a = df[df.columns[: len(df.columns) // 3 + 1]]
b = df[df.columns[len(df.columns) // 3 + 1 :]]
b.columns = a.columns
df_out = a.append(b).reset_index(drop=True)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
4 d e f
5 10 13 16
6 11 14 17
7 12 15 18
EDIT: To handle unknown widths:
dfs = []
for i in range(0, len(df.columns), 3):
dfs.append(df[df.columns[i : i + 3]])
dfs[-1].columns = df.columns[:3]
df_out = pd.concat(dfs)
print(df_out)
Prints:
0 1 2
0 a b c
1 1 4 7
2 2 5 8
3 3 6 9
0 d e f
1 10 13 16
2 11 14 17
3 12 15 18
0 g h i
1 19 22 25
2 20 23 26
3 21 24 27
If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
I have a df1 and df2 as follows:
df1:
a b c
0 1 2 4
1 6 12 24
2 7 14 28
3 4 8 16
4 3 6 12
df2:
a b c
0 7 8 9
1 10 11 12
How can I insert df2 to df1 but after the second row? My desired output will like this.
a b c
0 1 2 4
1 6 12 24
2 7 8 9
3 10 11 12
4 7 14 28
5 4 8 16
6 3 6 12
Thank you.
Use concat with splitted first DataFrame by DataFrame.iloc:
df = pd.concat([df1.iloc[:2], df2, df1.iloc[2:]], ignore_index=False)
print (df)
a b c
0 1 2 4
1 6 12 24
0 7 8 9
1 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
Here is another way using np.r_:
df2.index=range(len(df1),len(df1)+len(df2)) #change index where df1 ends
final=pd.concat((df1,df2)) #concat
final.iloc[np.r_[0,1,df2.index,2:len(df1)]] #select ordering with iloc
#final.iloc[np.r_[0:2,df2.index,2:len(df1)]]
a b c
0 1 2 4
1 6 12 24
5 7 8 9
6 10 11 12
2 7 14 28
3 4 8 16
4 3 6 12
My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.
consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))