Data-frame manipulation in python - python

I have a csv file with two columns of a and b as below:
a b
601 1
602 2
603 3
604 4
605 5
606 6
I want to read and save data in a new csv file as below:
s id
601 1
602 1
603 1
604 2
605 2
606 2
I have tried this code:
data=pd.read_csv('./dataset/test4.csv')
list=[]
i=0
while(i<6):
list.append(data['a'].iloc[i:i+3])
i+=3
df = pd.DataFrame(list)
print(df)
by this out put:
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0
First I need to save the list in a dataframe with following result:
0 1 2 3 4 5
601.0 602.0 603.0 604.0 605.0 606.0
and then save in a csv file. However I've got stuck in the first part.
Thanks for your help.

Assuming every 3 items in a constitute a group in b, just do a little integer division on the index.
data['b'] = (data.index // 3 + 1)
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2
Saving to CSV is straightforward - all you have to do is call df.to_csv(...).
Division by index is fine as long as you have a monotonically increasing integer index. Otherwise, you can use np.arange (on MaxU's recommendation):
data['b'] = np.arange(len(data)) // 3 + 1
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2

By using you output
df.stack().unstack()
Out[115]:
0 1 2 3 4 5
a 601.0 602.0 603.0 604.0 605.0 606.0
Data Input
df
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0

In [45]: df[['a']].T
Out[45]:
0 1 2 3 4 5
a 601 602 603 604 605 606
or
In [39]: df.set_index('b').T.rename_axis(None, axis=1)
Out[39]:
1 2 3 4 5 6
a 601 602 603 604 605 606

Related

Rows not sorted after concat

In the code below I'm attempting to reorder a portion of a dataframe and then join it with another portion. The code will sort but when I attempt to run the last line it returns the unsorted frame. Can anyone help with this?
Code
copied = frame[frame['PLAYVAL'].isin([3,4])].copy()
copied_col = copied['PLAY_EVT']
copied = copied.drop(columns=['PLAY_EVT'],axis=1)
copied = copied.sort_values(['TIME_ELAPSED','SHOTVAL'],ascending=[True,True]).copy()
result = pd.concat([copied_col,copied],axis=1)
Frame
PLAY_EVT
TIME_ELAPSED
INFO
SHOTVAL
0
1
132
1of2
2
1
2
132
2of2
3
2
3
342
3of3
6
3
4
342
2of3
5
4
5
342
1of3
4
5
6
786
2of2
3
6
7
786
1of2
2
Expected Outcome
PLAY_EVT
TIME_ELAPSED
INFO
SHOTVAL
0
1
132
1of2
2
1
2
132
2of2
3
2
3
342
1of3
4
3
4
342
2of3
5
4
5
342
3of3
6
5
6
786
1of2
2
6
7
786
2of2
3
Figured it out. Must've had something to do with indexes.
copied = frame[frame['PLAYVAL'].isin([3,4])]
copied_col = copied['PLAY_EVT'].reset_index(drop=True)
copied = copied.drop(columns=['PLAY_EVT'],axis=1)
copied = copied.sort_values(['TIME_ELAPSED','SHOTVAL'],ascending=[True,True]).reset_index(drop=True)
result = pd.merge(copied_col, copied, left_index=True, right_index=True)

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

Remove reversed duplicates from a data frame

Can anyone suggest a good solution to remove reversed duplicates from a data frame?
My data looks like this, where first and second columns are reversed duplicates.
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
TRINITY_DN16813_c0_g1_i4 TRINITY_DN16813_c0_g1_i3 96.104 231 9 0 190 420 429 199 2.979999999999999e-104 377
I need to keep only one row, where third column has the higher value
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
This the results when I use series.isin().
TRINITY_DN28139_c0_g1_i2 TRINITY_DN28139_c0_g1_i5 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN28139_c0_g1_i5 TRINITY_DN28139_c0_g1_i2 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN25313_c0_g1_i6 TRINITY_DN25313_c0_g1_i5 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25313_c0_g1_i5 TRINITY_DN25313_c0_g1_i6 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25502_c0_g1_i3 TRINITY_DN25502_c0_g1_i4 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN25502_c0_g1_i4 TRINITY_DN25502_c0_g1_i3 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN28726_c0_g1_i2 TRINITY_DN28726_c0_g1_i1 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN28726_c0_g1_i1 TRINITY_DN28726_c0_g1_i2 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN27942_c0_g1_i7 TRINITY_DN27942_c0_g1_i6 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i1 TRINITY_DN25118_c0_g1_i2 99.964 2770 1 0 81 2850 204 2973 0.0 5110
TRINITY_DN27942_c0_g1_i6 TRINITY_DN27942_c0_g1_i7 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i2 TRINITY_DN25118_c0_g1_i1 99.964 2770 1 0 204 2973 81 2850 0.0 5110
TRINITY_DN28502_c1_g1_i9 TRINITY_DN28502_c1_g1_i7 99.963 2678 1 0 1928 4605 2021 4698 0.0 4940
TRINITY_DN28502_c1_g1_i7 TRINITY_DN28502_c1_g1_i9 99.963 2678 1 0 2021 4698 1928 4605 0.0 4940
TRINITY_DN25619_c0_g1_i1 TRINITY_DN25619_c0_g1_i8 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN25619_c0_g1_i8 TRINITY_DN25619_c0_g1_i1 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN23022_c0_g1_i5 TRINITY_DN23022_c0_g1_i1 99.962 2622 1 0 1 2622 1 2622 0.0 4837
Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])
Use series.isin() to find same entries in both columns and drop duplicates:
df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]
Where col1 is the first column and col2 is the second
Output:
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.49 228 8 0 202 429 417 190 0.00 377
The problem is that labels in column 0 and column 1 must be taken as a pair so an isin alone would not work
First, a list of label pairs is needed to compare to (forward in the code). Given that (a,b) is the same as (b,a), all instances will just be replaced by (a,b)
Then all labels that are duplicated are renamed in the order a,b even if the higher row is b,a. This is necessary to do the grouping step later.
In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)
Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept. This will be the rows to be removed.
In [297]: dfi = df.set_index(['pair',2])
In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index
In [299]: dfi['drop'] = dfi.index.isin(to_drop)
In [300]: dfr = dfi.reset_index()
Rows are dropped by the index number where the 'drop' column is True.
The temporary 'drop' column is also removed.
In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)
In [302]: df_dropped
Out[302]:
0 1 2 3 4 5 6 7 8 9 10 11
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.050000e-104 377

How to concatenate rows while preserving all rows and have one result value per group

I am trying to generate a unique group-value for each observation made up of the contents of a column concatenated together, while keeping all the rows intact.
I have observations that can be grouped on a specific column (column A below). I want to create a unique value per group made of the content of each row of this group, but keeping the rows untouched.
I have tried solutions provided here and here, but these solutions collapse the results, leaving one row per group, whereas I wish to keep all rows.
import pandas as pd
d = {'A': [1, 2, 3, 3, 4, 5, 5, 6],
'B': [345, 366, 299, 455, 879, 321, 957, 543]}
df = pd.DataFrame(d)
print(df)
A B
0 1 345
1 2 366
2 3 299
3 3 455
4 4 879
5 5 321
6 5 957
7 5 689
8 6 543
df['B'] = df['B'].astype(str)
df['B_concat'] = df.groupby(['A'])['B'].apply('/'.join)
print(df)
A B B_concat
0 1 345 NaN
1 2 366 345
2 3 299 366
3 3 455 299/455
4 4 879 879
5 5 321 321/957/689
6 5 957 543
7 5 689 NaN
8 6 543 NaN
Units in the same group should have the same B_concat value.
A B B_concat
0 1 345 345
1 2 366 366
2 3 299 299/455
3 3 455 299/455
4 4 879 879
5 5 321 321/957/689
6 5 957 321/957/689
7 5 689 321/957/689
8 6 543 543
Use GroupBy.transform for return Series with same size like original DataFrame, so possible assign to new column:
df['B'] = df['B'].astype(str)
df['B_concat'] = df.groupby(['A'])['B'].transform('/'.join)
One line solution should be:
df['B_concat'] = df['B'].astype(str).groupby(df['A']).transform('/'.join)
print (df)
A B B_concat
0 1 345 345
1 2 366 366
2 3 299 299/455
3 3 455 299/455
4 4 879 879
5 5 321 321/957
6 5 957 321/957
7 6 543 543
Or:
df['B_concat'] = df.groupby(['A'])['B'].transform(lambda x: '/'.join(x.astype(str)))

how to sort a column and group them on pandas?

I am new on pandas. I try to sort a column and group them by their numbers.
df = pd.read_csv("12Patients150526 mutations-ORIGINAL.txt", sep="\t", header=0)
samp=df["SAMPLE"]
samp
Out[3]:
0 11
1 2
2 9
3 1
4 8
5 2
6 1
7 3
8 10
9 4
10 5
..
53157 12
53158 3
53159 2
53160 10
53161 2
53162 3
53163 4
53164 11
53165 12
53166 11
Name: SAMPLE, dtype: int64
#sorting
grp=df.sort(samp)
This code does not work. Can somebody help me about my problem, please.
How can I sort and group them by their numbers?
To sort df based on a particular column, use df.sort() and pass column name as parameter.
import pandas as pd
import numpy as np
# data
# ===========================
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,10,1000), columns=['SAMPLE'])
df
SAMPLE
0 6
1 1
2 4
3 4
4 8
5 4
6 6
7 3
.. ...
992 3
993 2
994 1
995 2
996 7
997 4
998 5
999 4
[1000 rows x 1 columns]
# sort
# ======================
df.sort('SAMPLE')
SAMPLE
310 1
710 1
935 1
463 1
462 1
136 1
141 1
144 1
.. ...
174 9
392 9
386 9
382 9
178 9
772 9
890 9
307 9
[1000 rows x 1 columns]

Categories