How to split dataframe into multiple dataframes based on column-name?

How to split dataframe into multiple dataframes based on column-name? - python

I have a dataframe with columns like this:
['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0',
't_energy1', 't_energy2']
And I have a code which returns the average of three columns with the same name:
# Takes in a dataframe with three columns and returns a dataframe with one column of their means as integers
def average_column(dataframe):
dataframe = dataframe.copy() # To avoid SettingWithCopyWarning
# Create new column name without integers
temp = dataframe.columns.tolist()[0]
col_name = temp.rstrip(temp[2:-1])
dataframe[col_name] = dataframe.mean(axis=1) # Add column to the dataframe (axis=1 means the mean() is applied row-wise)
mean_df = dataframe.iloc[: , -1:] # Isolated column of the mean by selecting all rows (:) for the last column (-1:)
print("Original:\n{}\nAverage columns:\n{}".format(dataframe, mean_df))
return mean_df.astype(float)
This function gives me this output:
Original:
t_dance0 t_dance1 t_dance2 dance
0 0.549 0.623 0.5190 0.563667
1 0.871 0.702 0.4160 0.663000
2 0.289 0.328 0.2340 0.283667
3 0.886 0.947 0.8260 0.886333
4 0.724 0.791 0.7840 0.766333
... ... ... ... ...
Average columns:
dance
0 0.563667
1 0.663000
2 0.283667
3 0.886333
4 0.766333
... ...
I asked this question about how I can split it into unique and duplicate columns. Which led me to this code:
# Function that splits dataframe into two separate dataframes, one with all unique
columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
cols = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = cols[cols == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate columns:\n\n{}".format(unq_df, dup_df))
Which gives me this output:
Unique columns:
id
0 22352
1 106534
2 23608
3 8655
4 49670
... ...
Duplicate columns:
t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2
0 292720 293760.0 292733.0 0.549 0.623 0.5190
1 213760 181000.0 245973.0 0.871 0.702 0.4160
2 157124 130446.0 152450.0 0.289 0.328 0.2340
3 127896 176351.0 166968.0 0.886 0.947 0.8260
4 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ...
2828 70740 262400.0 220680.0 0.224 0.609 0.7110
2829 252226 222400.0 214973.0 0.526 0.623 0.4820
2830 269146 251560.0 172760.0 0.551 0.756 0.7820
2831 344764 425613.0 249652.0 0.473 0.572 0.8230
2832 210955 339869.0 304124.0 0.112 0.523 0.0679
I have tried to combine these functions in another function that takes in a dataframe and returns the dataframe with all duplicate columns replaced by their mean, but I have trouble with splitting the dups_df into smaller dataframes. Is there a simpler way I can do this?
An example on the desired output:
Original:
total_tracks t_dur0 t_dur1 t_dur2 t_dance0 t_dance1 t_dance2 \
0 4 292720 293760.0 292733.0 0.549 0.623 0.5190
1 12 213760 181000.0 245973.0 0.871 0.702 0.4160
2 59 157124 130446.0 152450.0 0.289 0.328 0.2340
3 8 127896 176351.0 166968.0 0.886 0.947 0.8260
4 17 210320 226253.0 211880.0 0.724 0.791 0.7840
... ... ... ... ... ... ... ...
After function:
total_tracks popularity duration dance
0 4 21 293071.000000 0.563667
1 12 14 213577.666667 0.663000
2 59 41 146673.333333 0.283667
3 8 1 157071.666667 0.886333
4 17 47 216151.000000 0.766333
... ... ... ...

Use wide_to_long for reshape original DataFrame first and then aggregate mean:
cols = ['total_tracks']
df1 = (pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp')
.reset_index()
.drop('tmp', 1)
.groupby(cols, as_index=False)
.mean())
print (df1)
total_tracks t_dur t_dance
0 4 293071.000000 0.563667
1 8 157071.666667 0.886333
2 12 213577.666667 0.663000
3 17 216151.000000 0.766333
4 59 146673.333333 0.283667
Details:
cols = ['total_tracks']
print(pd.wide_to_long(df,
stubnames=['t_dur','t_dance'],
i=cols,
j='tmp'))
t_dur t_dance
total_tracks tmp
4 0 292720.0 0.549
12 0 213760.0 0.871
59 0 157124.0 0.289
8 0 127896.0 0.886
17 0 210320.0 0.724
4 1 293760.0 0.623
12 1 181000.0 0.702
59 1 130446.0 0.328
8 1 176351.0 0.947
17 1 226253.0 0.791
4 2 292733.0 0.519
12 2 245973.0 0.416
59 2 152450.0 0.234
8 2 166968.0 0.826
17 2 211880.0 0.784

Related

Pandas: use apply to create 2 new columns

I have a dataset where col a represent the number of total values in values e,i,d,t which are in string format separated by a "-"
a e i d t
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
I want to create 8 new columns, 4 representing the SUM of (e-i-d-t), 4 the product.
For example:
def funct_two_outputs(E, I, d, t, d_calib = 50):
return E+i+d+t, E*i*d*t
OUT first 2 values:
SUM_0, row0 = 40+0.5+30+1 SUM_1 = 80+0.3+32+1
The sum and product are example functions substituting my functions which are a bit more complicated.
I have written out a function **expand_on_col ** that creates separates all the e,i,d,t values into new columns:
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
Now i need to create 4 new columsn that are the sum of eidt, and 4 that are the prodct.
Example output for SUM:
index a e i d t a-0 e-0 e-1 e-2 e-3 i-0 i-1 i-2 i-3 d-0 d-1 d-2 d-3 t-0 t-1 t-2 t-3 sum-0 sum-1 sum-2 sum-3
0 0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
1 1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
2 3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
3 5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
If i run the code with funct_one_output(only returns sum) it works, but wit the funct_two_outputs(suma and product) I get an error.
Here is the code:
import pandas as pd
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
def funct_two_outputs(E, I, d, t, d_calib = 50): #the function i want to pass
return E+i+d+t, E*i*d*t
def funct_one_outputs(E, I, d, t, d_calib = 50): #for now i can olny use this one, cant use 2 return values.
return E+i+d+t
for col in columns:
df = expand_on_col (df_=df, col_to_split = col, sep='-', prefix=f"{col}-")
cols_ = df.columns.drop(columns)
df[cols_]= df[cols_].apply(pd.to_numeric, errors="coerce")
df["a"] = df["a"].apply(pd.to_numeric, errors="coerce")
df.reset_index(inplace=True)
for i in range (max(df["a"])):
name_1, name_2 = f"sum-{i}", f"mult-{i}"
df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
#if i try and fill 2 outputs it wont work
df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-306-85157b89d696> in <module>()
68 df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
69 #if i try and fill 2 outputs it wont work
---> 70 df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
71
72
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __setitem__(self, key, value)
3039 self._setitem_frame(key, value)
3040 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3041 self._setitem_array(key, value)
3042 else:
3043 # set column
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3074 )[1]
3075 self._check_setitem_copy()
-> 3076 self.iloc._setitem_with_indexer((slice(None), indexer), value)
3077
3078 def _setitem_frame(self, key, value):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
1751 if len(ilocs) != len(value):
1752 raise ValueError(
-> 1753 "Must have equal len keys and value "
1754 "when setting with an iterable"
1755 )
ValueError: Must have equal len keys and value when setting with an iterable

Don't Use apply
If you can help it
s = pd.to_numeric(
df[['e', 'i', 'd', 't']]
.stack()
.str.split('-', expand=True)
.stack()
)
sums = s.sum(level=[0, 2]).rename('Sum')
prods = s.prod(level=[0, 2]).rename('Prod')
sums_prods = pd.concat([sums, prods], axis=1).unstack()
sums_prods.columns = [f'{o}-{i}' for o, i in sums_prods.columns]
df.join(sums_prods)
a e i d t Sum-0 Sum-1 Sum-2 Sum-3 Prod-0 Prod-1 Prod-2 Prod-3
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0

Transform dataframe value to range value in Python 3

I have a dataframe with the values:
3.05
35.97
49.11
48.80
48.02
10.61
25.69
6.02
55.36
0.42
47.87
2.26
54.43
8.85
8.75
14.29
41.29
35.69
44.27
1.08
I want transform the value into range and give new value to each value.
From the df we know the min value is 0.42 and the max value is 55.36.
From range min to max, I want divide to 4 group which is:
0.42 - 14.15 transform to 1
14.16 - 27.88 transform to 2
27.89 - 41.61 transform to 3
41.62 - 55.36 transform to 4
so the result I expected is
1
3
4
4
4
1
2
1
4
1
4
1
4
1
1
2
3
3
4
1

This is normally called binning, but pandas calls it cut. Sample code is below:
import pandas as pd
# Create a list of numbers, with a header called "nums"
data_list = [('nums', [3.05, 35.97, 49.11, 48.80, 48.02, 10.61, 25.69, 6.02, 55.36, 0.42, 47.87, 2.26, 54.43, 8.85, 8.75, 14.29, 41.29, 35.69, 44.27, 1.08])]
# Create the labels for the bin
bin_labels = [1,2,3,4]
# Create the dataframe object using the data_list
df = pd.DataFrame.from_items(data_list)
# Define the scope of the bins
bins = [0.41, 14.16, 27.89, 41.62, 55.37]
# Create the "bins" column using the cut function using the bins and labels
df['bins'] = pd.cut(df['nums'], bins=bins, labels=bin_labels)
This creates a dataframe which has the following structure:
print(df)
nums bins
0 3.05 1
1 35.97 3
2 49.11 4
3 48.80 4
4 48.02 4
5 10.61 1
6 25.69 2
7 6.02 1
8 55.36 4
9 0.42 1
10 47.87 4
11 2.26 1
12 54.43 4
13 8.85 1
14 8.75 1
15 14.29 2
16 41.29 3
17 35.69 3
18 44.27 4
19 1.08 1

You could construct a function like the following to have full control over the process:
def transform(l):
l2 = []
for i in l:
if 0.42 <= i <= 14.15:
l2.append(1)
elif i <= 27.8:
l2.append(2)
elif i <= 41.61:
l2.append(3)
elif i <= 55.36:
l2.append(4)
return(l2)
df['nums'] = transform(df['nums'])

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

im facing the following problem and i dont know what is the cleanest/smartest way to solve it.
I have a dataframe called wfm that contains the input for my simulation
wfm.head()
Out[51]:
OPN Vin Vout_ref Pout_ref CFN ... Cdclink Cdm L k ron
0 6 350 750 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
1 7 400 800 92000 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
2 8 350 900 80500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
3 9 450 750 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
4 10 450 900 103500 1 ... 0.00012 0.00012 0.000131 -0.37 0.001
[5 rows x 13 columns]
then every simulation loop I receive 2 Series outputs_rms and outputs_avg that look like this:
outputs_rms outputs_avg
Out[53]: Out[54]:
time.rms 0.057751 time.avg 5.78E-02
Vi_dc.voltage.rms 400 Vi_dc.voltage.avg 4.00E+02
Vi_dc.current.rms 438.333188 Vi_dc.current.avg 3.81E+02
Vi_dc.power.rms 175333.2753 Vi_dc.power.avg 1.53E+05
Am_in.current.rms 438.333188 Am_in.current.avg 3.81E+02
Cdm.voltage.rms 396.614536 Cdm.voltage.avg 3.96E+02
Cdm.current.rms 0.213185 Cdm.current.avg -5.14E-05
motor_phU.current.rms 566.035833 motor_phU.current.avg -5.67E+02
motor_phU.voltage.rms 296.466083 motor_phU.voltage.avg -9.17E-02
motor_phV.current.rms 0.061024 motor_phV.current.avg 2.58E-02
motor_phV.voltage.rms 1.059341 motor_phV.voltage.avg -1.24E-09
motor_phW.current.rms 566.005071 motor_phW.current.avg 5.67E+02
motor_phW.voltage.rms 297.343876 motor_phW.voltage.avg 9.17E-02
S_ULS.voltage.rms 305.017804 S_ULS.voltage.avg 2.65E+02
S_ULS.current.rms 358.031053 S_ULS.current.avg -1.86E+02
S_UHS.voltage.rms 253.340047 S_UHS.voltage.avg 1.32E+02
S_UHS.current.rms 438.417985 S_UHS.current.avg 3.81E+02
S_VLS.voltage.rms 295.509073 S_VLS.voltage.avg 2.64E+02
S_VLS.current.rms 0 S_VLS.current.avg 0.00E+00
S_VHS.voltage.rms 152.727975 S_VHS.voltage.avg 1.32E+02
S_VHS.current.rms 0.061024 S_VHS.current.avg -2.58E-02
S_WLS.voltage.rms 509.388666 S_WLS.voltage.avg 2.64E+02
S_WLS.current.rms 438.417985 S_WLS.current.avg 3.81E+02
S_WHS.voltage.rms 619.258959 S_WHS.voltage.avg 5.37E+02
S_WHS.current.rms 357.982417 S_WHS.current.avg -1.86E+02
Cdclink.voltage.rms 801.958092 Cdclink.voltage.avg 8.02E+02
Cdclink.current.rms 103.73088 Cdclink.current.avg 2.08E-05
Am_out.current.rms 317.863371 Am_out.current.avg 1.86E+02
Vo_dc.voltage.rms 800 Vo_dc.voltage.avg 8.00E+02
Vo_dc.current.rms 317.863371 Vo_dc.current.avg -1.86E+02
Vo_dc.power.rms 254290.6969 Vo_dc.power.avg -1.49E+05
CFN 1 CFN 1.00E+00
OPN 6 OPN 6.00E+00
dtype: float64 dtype: float64
then my goal is to place outputs_rms and outputs_avg on the right line of wfm, based on 'CFN' and 'OPN' values.
what is your suggestions?
thanks
Riccardo

Suppose that you create these series as outputs output_rms_1, output_rms_2, etc.,
than the series can be combined in one dataframe
import pandas as pd
dfRms = pd.DataFrame([output_rms_1, output_rms_2, output_rms_3])
Next output, say output_rms_10, can simply be added by using:
dfRms = dfRms.append(output_rms_10, ignore_index=True)
Finally, when all outputs are joined into one Dataframe,
you can merge the original wfm with the output, i.e.
result = pd.merge(wfm, dfRms, on=['CFN', 'OPN'], how='left')
Similarly for avg.

Problem to implement count, groupby, np.repeat and agg with pandas

I have similar dataframe pandas:
df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
I need to work out my dataset for the following result:
extract = df.assign(count=np.repeat(range(10),10)).groupby('count',as_index=False).agg(['mean','min', 'max'])
But if i use np.repeat(range(150),150)) i received this error:

This doesn't work because the .assign you're performing needs to have enough values to fit the original dataframe:
In [81]: df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
In [82]: df.assign(count=np.repeat(range(10),10))
ValueError: Length of values does not match length of index
In this case, everything works fine if we do 10 groups repeated 6,180 times:
In [83]: df.assign(count=np.repeat(range(10),6180))
Out[83]:
x y z count
0 0.781364 0.996545 0.756592 0
1 0.609127 0.981688 0.626721 0
2 0.547029 0.167678 0.198857 0
3 0.184405 0.484623 0.219722 0
4 0.451698 0.535085 0.045942 0
... ... ... ... ...
61795 0.783192 0.969306 0.974836 9
61796 0.890720 0.286384 0.744779 9
61797 0.512688 0.945516 0.907192 9
61798 0.526564 0.165620 0.766733 9
61799 0.683092 0.976219 0.524048 9
[61800 rows x 4 columns]
In [84]: extract = df.assign(count=np.repeat(range(10),6180)).groupby('count',as_index=False).agg(['mean','min', 'max'])
In [85]: extract
Out[85]:
x y z
mean min max mean min max mean min max
count
0 0.502338 0.000230 0.999546 0.501603 0.000263 0.999842 0.503807 0.000113 0.999826
1 0.500392 0.000059 0.999979 0.499935 0.000012 0.999767 0.500114 0.000230 0.999811
2 0.498377 0.000023 0.999832 0.496921 0.000003 0.999475 0.502887 0.000028 0.999828
3 0.504970 0.000637 0.999680 0.500943 0.000256 0.999902 0.497370 0.000257 0.999969
4 0.501195 0.000290 0.999992 0.498617 0.000149 0.999779 0.497895 0.000022 0.999877
5 0.499476 0.000186 0.999956 0.503227 0.000308 0.999907 0.504688 0.000100 0.999756
6 0.495488 0.000378 0.999606 0.499893 0.000119 0.999740 0.495924 0.000031 0.999556
7 0.498443 0.000005 0.999417 0.495728 0.000262 0.999972 0.501255 0.000087 0.999978
8 0.494110 0.000014 0.999888 0.495197 0.000074 0.999970 0.493215 0.000166 0.999718
9 0.496333 0.000365 0.999307 0.502074 0.000110 0.999856 0.499164 0.000035 0.999927

Parsing and create a new df with conditions

I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can pass the filter while the other does not, it is important to keep only paired genes which are keeping on the two previous filtering, so what I need to do is actually to parse my first df such this one:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and check row by row if (ex the first row) seq1_A is present in the df2 and if seq8_B is also present in the df3, then keep this row in the df1 and add it in a new df4.
Here is an example with output wanted:
first df:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
df2 (sp1) (seq3_A is absent)
Seq_1.id
seq1_A
seq2_A
seq4_A
df3 (sp2) (Seq11_B is absent)
Seq_2.id
seq8_B
Seq9_B
Seq10_B
Then because Seq11_B and seq3_A are not present, the df4 (output) would be:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
and I got an empty output, only with columns names but it should not be like that.
Here are the data if you cant to test the code on it :
df1:
Unnamed: 0 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
0 0 g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104 0.23600000000000002 0.142 535.0 1024.0 49.1588785046729 51.171875 50.165376752336456 535.0
1 1 g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574 0.7120000000000001 0.489 246.0 222.0 47.967479674796756 44.594594594594604 46.28103713469567 222.0
2 2 g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867 0.262 0.139 895.0 749.0 56.312849162011176 57.67690253671562 56.994875849363396 749.0
3 3 g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587 0.5760000000000001 0.433 597.0 633.0 37.85594639865997 39.810426540284354 38.83318646947217 597.0
4 4 g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666 0.1 0.069 450.0 461.0 45.111111111111114 44.90238611713666 45.006748614123886 450.0
5 5 g1005.t1_0035_0035 g11524.t1_0042_0042 0.3528036631463747 19.32549458735676 0.71 0.512 3177.0 3804.0 39.06200818382121 52.944269190325976 46.0031386870736 3177.0
6 6 g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786 26.823981621115166 0.6859999999999999 0.469 516.0 591.0 49.224806201550386 53.46869712351946 51.346751662534935 516.0
7 7 g6202.t1_0035_0035 g193.t1_0042_0042 0.4679458383555545 17.81312422445775 0.66 0.462 804.0 837.0 41.91542288557214 47.67025089605735 44.79283689081474 804.0
8 8 g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165 0.13320612138898 0.122 0.067 348.0 408.0 56.89655172413793 55.392156862745104 56.1443542934415 348.0
9 9 g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0 320.0 59.012345679012356 58.4375 58.72492283950618 320.0
10 10 g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965 0.15106487998618026 0.135 0.17600000000000002 270.0 276.0 51.111111111111114 51.44927536231884 51.28019323671497 270.0
11 11 g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916 21.13839474189674 0.6940000000000001 0.401 297.0 357.0 54.882154882154886 50.42016806722689 52.65116147469089 297.0
12 12 EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758 19.278839157918302 0.7070000000000001 0.488 1321.0 1719.0 38.53141559424678 43.92088423502036 41.22614991463357 1321.0
13 13 g13075.t1_0035_0042 g504.t1_0042_0035 0.3317504066721263 4.790120127840871 0.65 0.38799999999999996 372.0 408.0 59.40860215053763 51.470588235294116 55.43959519291587 372.0
14 14 g1026.t1_0035_0035 g7716.t1_0042_0042 0.21445770772761286 13.92799368027682 0.626 0.344 336.0 315.0 38.095238095238095 44.444444444444436 41.26984126984127 315.0
15 15 g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637 20.33679494952895 0.6759999999999999 0.44799999999999995 320.0 366.0 50.9375 49.453551912568315 50.19552595628416 320.0
df2:
Unnamed: 0 gene scaf_name start end cov_depth GC
179806 g13600.t1_0042_0042 scaffold_6556 1 1149 2.42361684558216 0.528846153846154
315037 g34744.t1_0042_0035 scaffold_8076 17 765 3.49803921568627 0.386138613861386
317296 g35222.t1_0042_0035 scaffold_9018 1 614 93.071661237785 0.41
183513 g14327.t1_0042_0042 scaffold_9358 122 529 3.3184165232357996 0.36
328164 g37790.t1_0042_0035 scaffold_16356 1 320 2.73125 0.436241610738255
326617 g37405.t1_0042_0035 scaffold_14890 1 341 1.3061224489795902 0.36898395721925104
188515 g15510.t1_0042_0042 scaffold_20183 1 276 137.326086956522 0.669354838709677
184561 g14562.t1_0042_0042 scaffold_10427 1 494 157.993927125506 0.46145940390544704
290684 g30982.t1_0042_0035 scaffold_3800 440 940 174.499839537869 0.39823008849557506
179993 g13632.t1_0042_0042 scaffold_6654 29 1114 3.56506849315068 0.46153846153846206
181670 g13942.t1_0042_0042 scaffold_7830 1 811 5.307028360049321 0.529411764705882
196148 g20290.t1_0042_0035 scaffold_1145 2707 9712 78.84112231766741 0.367283950617284
313624 g34464.t1_0042_0035 scaffold_7610 1 480 7.740440324449589 0.549019607843137
303133 g32700.t1_0042_0035 scaffold_5119 1735 2373 118.436578171091 0.49074074074074103
df3:
Unnamed: 0 gene scaf_name start end cov_depth GC
428708 g66097.t1_0035_0035 scaffold_306390 1 695 32.2431654676259 0.389880952380952
342025 g50055.t1_0035_0035 scaffold_188566 15 954 7.062893081761009 0.351129363449692
214193 g28436.t1_0035_0042 scaffold_231066 1 842 25.9774346793349 0.348837209302326
400337 g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698 0.353846153846154
224023 g30148.t1_0035_0042 scaffold_263686 10 414 23.2072538860104 0.34108527131782895
184987 g24481.t1_0035_0035 scaffold_65047 817 1593 27.7840552416824 0.533898305084746
249413 g34492.t1_0035_0035 scaffold_106432 1 511 3.2482544608223396 0.368318122555411
249418 g34493.t1_0035_0035 scaffold_106432 547 1230 3.2482544608223396 0.368318122555411
12667 g1120.t1_0035_0042 scaffold_2095 2294 2794 47.864745898359295 0.56203288490284
252797 g35042.t1_0035_0035 scaffold_108853 274 1276 20.269592476489 0.32735426008968604
255878 g36112.t1_0035_0042 scaffold_437464 1 540 74.8252551020408 0.27884615384615397
40058 g4082.t1_0035_0042 scaffold_11195 579 1535 33.4396168320219 0.48487467588591204
271053 g39343.t1_0035_0042 scaffold_590976 1 290 19.6666666666667 0.38636363636363596
89911 g10947.t1_0035_0035 scaffold_21433 1735 2373 32.4222503160556 0.408571428571429

This should do it:
df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
# Seq_1.id Seq_2.id
#0 seq1_A seq8_B
#1 seq2_A Seq9_B
EDIT
You must have made a permutation, this doesn't return empty:
df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split dataframe into multiple dataframes based on column-name? - python

Related

Pandas: use apply to create 2 new columns

Transform dataframe value to range value in Python 3

what is the best way to merge pandas.Dataframe with pandas.Series based on df.columns and Series.index names?

Problem to implement count, groupby, np.repeat and agg with pandas

Parsing and create a new df with conditions

Categories

Resources