Remove reversed duplicates from a data frame - python

Can anyone suggest a good solution to remove reversed duplicates from a data frame?
My data looks like this, where first and second columns are reversed duplicates.
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
TRINITY_DN16813_c0_g1_i4 TRINITY_DN16813_c0_g1_i3 96.104 231 9 0 190 420 429 199 2.979999999999999e-104 377
I need to keep only one row, where third column has the higher value
TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.049999999999999e-104 377
This the results when I use series.isin().
TRINITY_DN28139_c0_g1_i2 TRINITY_DN28139_c0_g1_i5 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN28139_c0_g1_i5 TRINITY_DN28139_c0_g1_i2 99.971 3465 1 0 1 3465 1 3465 0.0 6394
TRINITY_DN25313_c0_g1_i6 TRINITY_DN25313_c0_g1_i5 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25313_c0_g1_i5 TRINITY_DN25313_c0_g1_i6 99.97 3315 1 0 1 3315 1 3315 0.0 6117
TRINITY_DN25502_c0_g1_i3 TRINITY_DN25502_c0_g1_i4 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN25502_c0_g1_i4 TRINITY_DN25502_c0_g1_i3 99.96799999999999 3078 1 0 1 3078 1 3078 0.0 5679
TRINITY_DN28726_c0_g1_i2 TRINITY_DN28726_c0_g1_i1 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN28726_c0_g1_i1 TRINITY_DN28726_c0_g1_i2 99.96600000000001 5805 2 0 1 5805 1 5805 0.0 10709
TRINITY_DN27942_c0_g1_i7 TRINITY_DN27942_c0_g1_i6 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i1 TRINITY_DN25118_c0_g1_i2 99.964 2770 1 0 81 2850 204 2973 0.0 5110
TRINITY_DN27942_c0_g1_i6 TRINITY_DN27942_c0_g1_i7 99.964 2760 1 0 1 2760 1 2760 0.0 5092
TRINITY_DN25118_c0_g1_i2 TRINITY_DN25118_c0_g1_i1 99.964 2770 1 0 204 2973 81 2850 0.0 5110
TRINITY_DN28502_c1_g1_i9 TRINITY_DN28502_c1_g1_i7 99.963 2678 1 0 1928 4605 2021 4698 0.0 4940
TRINITY_DN28502_c1_g1_i7 TRINITY_DN28502_c1_g1_i9 99.963 2678 1 0 2021 4698 1928 4605 0.0 4940
TRINITY_DN25619_c0_g1_i1 TRINITY_DN25619_c0_g1_i8 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN25619_c0_g1_i8 TRINITY_DN25619_c0_g1_i1 99.963 2715 1 0 1 2715 1 2715 0.0 5009
TRINITY_DN23022_c0_g1_i5 TRINITY_DN23022_c0_g1_i1 99.962 2622 1 0 1 2622 1 2622 0.0 4837

Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

Use series.isin() to find same entries in both columns and drop duplicates:
df=df.sort_values('col3',ascending=False)
df.loc[df['col1'].isin(df['col2']).drop_duplicates().index]
Where col1 is the first column and col2 is the second
Output:
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.49 228 8 0 202 429 417 190 0.00 377

The problem is that labels in column 0 and column 1 must be taken as a pair so an isin alone would not work
First, a list of label pairs is needed to compare to (forward in the code). Given that (a,b) is the same as (b,a), all instances will just be replaced by (a,b)
Then all labels that are duplicated are renamed in the order a,b even if the higher row is b,a. This is necessary to do the grouping step later.
In [293]: df['pair'] = df[[0, 1]].apply(l, axis=1)
Then to account for the value of column 2 (third column from left), the original data is grouped and the min of the group is kept. This will be the rows to be removed.
In [297]: dfi = df.set_index(['pair',2])
In [298]: to_drop = df.groupby([0,1])[2].min().reset_index().set_index([0,1,2]).index
In [299]: dfi['drop'] = dfi.index.isin(to_drop)
In [300]: dfr = dfi.reset_index()
Rows are dropped by the index number where the 'drop' column is True.
The temporary 'drop' column is also removed.
In [301]: df_dropped = dfr.drop(np.where(dfr['drop'])[0], axis=0).drop('drop', axis=1)
In [302]: df_dropped
Out[302]:
0 1 2 3 4 5 6 7 8 9 10 11
0 TRINITY_DN16813_c0_g1_i3 TRINITY_DN16813_c0_g1_i4 96.491 228 8 0 202 429 417 190 3.050000e-104 377

Related

Rule based sorting for each subset in a pandas dataframe

I have a sorting rule for a column of a dataframe I am working on. The rule is the position of two consecutive rows will change 50% of the time if the ratio of the values of two consecutive values of a specific column is within a defined range. Here is the coded
def randomized_sort(df):
"""
:param df: dataframe
:return: sorted dataframe based on the condition
"""
length = len(df) if len(df) % 2 == 0 else len(df) - 1
for i in range(0, length, 2):
if random.random() < 0.5:
if (0.7 < (df.iloc[i, :].weight) / (df.iloc[i + 1, :].weight) < 1.3):
a, b = df.iloc[i, :].copy(), df.iloc[i + 1, :].copy()
df.iloc[i, :], df.iloc[i + 1, :] = b, a
return df
However, I have a new dataframe in which I have to perform this operation in each subset /group. Please see below the data. The above operation needs to be done for each subset grouped by order column.
How can this be done?
From your question it is not clear, what you mean by subset/group.
Assuming, you want to treat each unique value in the order column as its own subset/group, you could simply filter your DataFrame for a given order value and process it with your method.
Afterwards, you can then concatenate all your individual DataFrames back together.
Example with dummy DataFrame:
df = pd.DataFrame()
number_of_rows = 20
df["order"]=[random.randint(0,3) for x in range(number_of_rows)]
df["weight"]=[random.randint(300,900) for x in range(number_of_rows)]
df.sort_values(by="order",inplace=True)
index
order
weight
0
0
629
1
0
842
3
0
326
5
0
533
6
0
621
17
1
772
11
1
333
10
1
399
18
1
369
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
14
2
411
15
2
862
16
2
888
9
2
526
12
3
345
13
3
430
Now filter the DataFrame for subset with order value of 1:
df[df["order"]==1]
index
order
weight
17
1
772
11
1
333
10
1
399
18
1
369
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
And then run your method with this subset DataFrame:
subset_df = df[df["order"]==1].copy()
sorted_df = randomized_sort(subset_df)
sorted_df
index
order
weight
17
1
772
11
1
333
10
1
369
18
1
399
19
1
380
7
1
414
4
1
800
2
1
640
8
1
670
Now, do this in a loop for every subset:
ordered_subsets = sorted(df.order.unique())
overall_sorted_df=pd.DataFrame()
for order_value in ordered_subsets:
subset_df = df[df["order"]==order_value].copy()
sorted_df = randomized_sort(subset_df)
overall_sorted_df = pd.concat([overall_sorted_df,sorted_df])
overall_sorted_df
index
order
weight
0
0
842
1
0
629
3
0
326
5
0
533
6
0
621
17
1
772
11
1
333
10
1
399
18
1
369
19
1
414
7
1
380
4
1
640
2
1
800
8
1
670
14
2
411
15
2
862
16
2
888
9
2
526
12
3
345
13
3
430
Hope that helps!

How to map pandas Groupby dataframe with sum values to another dataframe using non-unique column

I have two pandas dataframe df1 and df2. Where i need to find df1['seq'] by doing a groupby on df2 and taking the sum of the column df2['sum_column']. Below are sample data and my current solution.
df1
id code amount seq
234 3 9.8 ?
213 3 18
241 3 6.4
543 3 2
524 2 1.8
142 2 14
987 2 11
658 3 17
df2
c_id name role sum_column
1 Aus leader 6
1 Aus client 1
1 Aus chair 7
2 Ned chair 8
2 Ned leader 3
3 Mar client 5
3 Mar chair 2
3 Mar leader 4
grouped = df2.groupby('c_id')['sum_column'].sum()
df3 = grouped.reset_index()
df3
c_id sum_column
1 14
2 11
3 11
The next step where am having issues is to map the df3 to df1 and conduct a conditional check to see if df1['amount'] is greater then df3['sum_column'].
df1['seq'] = np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')[sum_column]), 1, 0)
printing out df1['code'].map(df3.set_index('c_id')['sum_column']), I get only NaN values.
Does anyone know what am doing wrong here?
Expected results:
df1
id code amount seq
234 3 9.8 0
213 3 18 1
241 3 6.4 0
543 3 2 0
524 2 1.8 0
142 2 14 1
987 2 11 0
658 3 17 1
Solution should be simplify with remove .reset_index() for df3 and pass Series to map:
s = df2.groupby('c_id')['sum_column'].sum()
df1['seq'] = np.where(df1['amount'] > df1['code'].map(s), 1, 0)
Alternative with casting boolean mask to integer for True, False to 1,0:
df1['seq'] = (df1['amount'] > df1['code'].map(s)).astype(int)
print (df1)
id code amount seq
0 234 3 9.8 0
1 213 3 18.0 1
2 241 3 6.4 0
3 543 3 2.0 0
4 524 2 1.8 0
5 142 2 14.0 1
6 987 2 11.0 0
7 658 3 17.0 1
You forget add quote for sum_column
df1['seq']=np.where(df1['amount'] > df1['code'].map(df3.set_index('c_id')['sum_column']), 1, 0)

get the new value based on the last row and checking the ID

The current dateframe.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 100
111 1/3/2018 500
111 1/4/2018 400
111 1/5/2018 0
222 4/1/2018 2000 200
222 4/2/2018 100
222 4/3/2018 700
222 4/4/2018 0
222 4/5/2018 0
222 4/6/2018 1000
222 4/7/2018 0
This is the dataframe what I am trying to get. Basically, i am trying to fill the star value for each row. AS you can see, every ID has a start value on the first day. next day's start value = last day's start value - last day's payment.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 1000 100
111 1/3/2018 900 500
111 1/4/2018 400 400
111 1/5/2018 0 0
222 4/1/2018 2000 200
222 4/2/2018 1800 100
222 4/3/2018 1700 700
222 4/4/2018 1000 0
222 4/5/2018 1000 0
222 4/6/2018 1000 1000
222 4/7/2018 0 0
Right now, I use Excel with this formula.
Start Value = if(ID in this row == ID in last row, last row's start value - last row's payment, Start Value)
It works well, I am wondering if I can do it in Python/Pandas. Thank you.
We can using groupby and shift + cumsum, ffill will setting up initial value for all row under the same Id, then we just need to deduct the cumulative payment from that row till the start , we get the remaining value at that point
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True))
Out[61]:
0 1000.0
1 1000.0
2 900.0
3 400.0
4 0.0
5 2000.0
6 1800.0
7 1700.0
8 1000.0
9 1000.0
10 1000.0
11 0.0
Name: StartValue, dtype: float64
Assign it back by adding inplace=Ture
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True),inplace=True)
df
Out[63]:
ID Date StartValue Payment
0 111 1/1/2018 1000.0 0
1 111 1/2/2018 1000.0 100
2 111 1/3/2018 900.0 500
3 111 1/4/2018 400.0 400
4 111 1/5/2018 0.0 0
5 222 4/1/2018 2000.0 200
6 222 4/2/2018 1800.0 100
7 222 4/3/2018 1700.0 700
8 222 4/4/2018 1000.0 0
9 222 4/5/2018 1000.0 0
10 222 4/6/2018 1000.0 1000
11 222 4/7/2018 0.0 0

Data-frame manipulation in python

I have a csv file with two columns of a and b as below:
a b
601 1
602 2
603 3
604 4
605 5
606 6
I want to read and save data in a new csv file as below:
s id
601 1
602 1
603 1
604 2
605 2
606 2
I have tried this code:
data=pd.read_csv('./dataset/test4.csv')
list=[]
i=0
while(i<6):
list.append(data['a'].iloc[i:i+3])
i+=3
df = pd.DataFrame(list)
print(df)
by this out put:
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0
First I need to save the list in a dataframe with following result:
0 1 2 3 4 5
601.0 602.0 603.0 604.0 605.0 606.0
and then save in a csv file. However I've got stuck in the first part.
Thanks for your help.
Assuming every 3 items in a constitute a group in b, just do a little integer division on the index.
data['b'] = (data.index // 3 + 1)
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2
Saving to CSV is straightforward - all you have to do is call df.to_csv(...).
Division by index is fine as long as you have a monotonically increasing integer index. Otherwise, you can use np.arange (on MaxU's recommendation):
data['b'] = np.arange(len(data)) // 3 + 1
data
a b
0 601 1
1 602 1
2 603 1
3 604 2
4 605 2
5 606 2
By using you output
df.stack().unstack()
Out[115]:
0 1 2 3 4 5
a 601.0 602.0 603.0 604.0 605.0 606.0
Data Input
df
0 1 2 3 4 5
a 601.0 602.0 603.0 NaN NaN NaN
a NaN NaN NaN 604.0 605.0 606.0
In [45]: df[['a']].T
Out[45]:
0 1 2 3 4 5
a 601 602 603 604 605 606
or
In [39]: df.set_index('b').T.rename_axis(None, axis=1)
Out[39]:
1 2 3 4 5 6
a 601 602 603 604 605 606

Python Pandas operate on row

Hi my dataframe look like:
Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264
Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not.
Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.
You can first convert it to strings (the integer columns) before concatenating with +:
In [25]: df['id'] = df['Store'].astype(str) +'_' +df['Dept'].astype(str) +'_'+df['Date']
In [26]: df
Out[26]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1_1_2010-02-05
1 1 1 2010-02-12 449 1_1_2010-02-12
2 1 1 2010-02-19 455 1_1_2010-02-19
3 1 1 2010-02-26 154 1_1_2010-02-26
4 1 1 2010-03-05 29 1_1_2010-03-05
5 1 1 2010-03-12 239 1_1_2010-03-12
6 1 1 2010-03-19 264 1_1_2010-03-19
For the log, you better use the numpy function. This is vectorized (math.log can only work on single scalar values):
In [34]: df['logSales'] = np.log(df['Sales'])
In [35]: df
Out[35]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1_1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1_1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1_1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1_1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1_1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1_1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1_1_2010-03-19 5.575949
Summarizing the comments, for a dataframe of this size, using apply will not differ much in performance compared to using vectorized functions (working on the full column), but when your real dataframe becomes larger, it will.
Apart from that, I think the above solution is also easier syntax.
In [153]:
import pandas as pd
import io
temp = """Store,Dept,Date,Sales
1,1,2010-02-05,245
1,1,2010-02-12,449
1,1,2010-02-19,455
1,1,2010-02-26,154
1,1,2010-03-05,29
1,1,2010-03-12,239
1,1,2010-03-19,264"""
df = pd.read_csv(io.StringIO(temp))
df
Out[153]:
Store Dept Date Sales
0 1 1 2010-02-05 245
1 1 1 2010-02-12 449
2 1 1 2010-02-19 455
3 1 1 2010-02-26 154
4 1 1 2010-03-05 29
5 1 1 2010-03-12 239
6 1 1 2010-03-19 264
[7 rows x 4 columns]
In [154]:
# apply a lambda function row-wise, you need to convert store and dept to strings in order to build the new string
df['id'] = df.apply(lambda x: str(str(x['Store']) + ' ' + str(x['Dept']) +'_'+x['Date']), axis=1)
df
Out[154]:
Store Dept Date Sales id
0 1 1 2010-02-05 245 1 1_2010-02-05
1 1 1 2010-02-12 449 1 1_2010-02-12
2 1 1 2010-02-19 455 1 1_2010-02-19
3 1 1 2010-02-26 154 1 1_2010-02-26
4 1 1 2010-03-05 29 1 1_2010-03-05
5 1 1 2010-03-12 239 1 1_2010-03-12
6 1 1 2010-03-19 264 1 1_2010-03-19
[7 rows x 5 columns]
In [155]:
import math
# now apply log to sales to create the new column
df['logSales'] = df['Sales'].apply(math.log)
df
Out[155]:
Store Dept Date Sales id logSales
0 1 1 2010-02-05 245 1 1_2010-02-05 5.501258
1 1 1 2010-02-12 449 1 1_2010-02-12 6.107023
2 1 1 2010-02-19 455 1 1_2010-02-19 6.120297
3 1 1 2010-02-26 154 1 1_2010-02-26 5.036953
4 1 1 2010-03-05 29 1 1_2010-03-05 3.367296
5 1 1 2010-03-12 239 1 1_2010-03-12 5.476464
6 1 1 2010-03-19 264 1 1_2010-03-19 5.575949
[7 rows x 6 columns]

Categories