Calculate mean values from pandas dataframe - python

I am trying to find a good way to calculate mean values from values in a dataframe. It contains measured data from an experiment and is imported from an excel sheet. The columns contain the time passed by, electric current and the corresponding voltage.
The current is changed in steps and then held for some time (the current values vary a little bit, so they are not exactly the same for each step). Now I want to calculate the mean voltage for each current step. Since it takes some time after the voltage gets stable after a step, I also want to leave out the first few voltage values after a step.
Currently I am doing this with loops, but I was wondering wether there is a nicer way with the usage of the groupby function (or others maybe).
Just say if you need more details or clarification.
Example of data:
s [A] [V]
0 6.0 -0.001420 0.780122
1 12.0 -0.002484 0.783297
2 18.0 -0.001478 0.785870
3 24.0 -0.001256 0.793559
4 30.0 -0.001167 0.806086
5 36.0 -0.000982 0.815364
6 42.0 -0.003038 0.825018
7 48.0 -0.001174 0.831739
8 54.0 0.000478 0.838861
9 60.0 -0.001330 0.846086
10 66.0 -0.001456 0.851556
11 72.0 0.000764 0.855950
12 78.0 -0.000916 0.859778
13 84.0 -0.000916 0.859778
14 90.0 -0.001445 0.863569
15 96.0 -0.000287 0.864303
16 102.0 0.000056 0.865080
17 108.0 -0.001119 0.865642
18 114.0 -0.000843 0.866434
19 120.0 -0.000997 0.866809
20 126.0 -0.001243 0.866964
21 132.0 -0.002238 0.867180
22 138.0 -0.001015 0.867177
23 144.0 -0.000604 0.867505
24 150.0 0.000507 0.867571
25 156.0 -0.001569 0.867525
26 162.0 -0.001569 0.867525
27 168.0 -0.001131 0.866756
28 174.0 -0.001567 0.866884
29 180.0 -0.002645 0.867240
.. ... ... ...
242 1708.0 24.703866 0.288902
243 1714.0 26.469208 0.219226
244 1720.0 26.468838 0.250437
245 1726.0 26.468681 0.254972
246 1732.0 26.468173 0.271525
247 1738.0 26.468260 0.247282
248 1744.0 26.467666 0.296894
249 1750.0 26.468085 0.247300
250 1756.0 26.468085 0.247300
251 1762.0 26.467808 0.261096
252 1768.0 26.467958 0.259615
253 1774.0 26.467828 0.260871
254 1780.0 28.232325 0.185291
255 1786.0 28.231697 0.197642
256 1792.0 28.231170 0.172802
257 1798.0 28.231103 0.170685
258 1804.0 28.229453 0.184009
259 1810.0 28.230816 0.181833
260 1816.0 28.230913 0.188348
261 1822.0 28.230609 0.178440
262 1828.0 28.231144 0.168507
263 1834.0 28.231144 0.168507
264 1840.0 8.813723 0.641954
265 1846.0 8.814301 0.652373
266 1852.0 8.818517 0.651234
267 1858.0 8.820255 0.637536
268 1864.0 8.821443 0.628136
269 1870.0 8.823643 0.636616
270 1876.0 8.823297 0.635422
271 1882.0 8.823575 0.622253
Output:
s [A] [V]
0 303.000000 -0.000982 0.857416
1 636.000000 0.879220 0.792504
2 699.000000 1.759356 0.752446
3 759.000000 3.519479 0.707161
4 816.000000 5.278372 0.669020
5 876.000000 7.064800 0.637848
6 939.000000 8.828799 0.611196
7 999.000000 10.593054 0.584402
8 1115.333333 12.357359 0.556127
9 1352.000000 14.117167 0.528826
10 1382.000000 15.882287 0.498577
11 1439.000000 17.646748 0.468379
12 1502.000000 19.410817 0.437342
13 1562.666667 21.175572 0.402381
14 1621.000000 22.939826 0.365724
15 1681.000000 24.704600 0.317134
16 1744.000000 26.468235 0.256047
17 1807.000000 28.231037 0.179606
18 1861.000000 8.819844 0.638190
The current approach:
df = df[['s','[A]','[V]']]
#Looping over the rows to separate current points
b=df['[A]'].iloc[0]
start=0
list = []
for index, row in df.iterrows():
if not math.isclose(row['[A]'], b, abs_tol=1e-02):
b=row['[A]']
list.append(df.iloc[start:index])
start=index
list.append(df.iloc[start:])
#Deleting first few points after each current change
list_b = []
for l in list:
list_b.append(l.iloc[3:])
#Calculating mean values for each current point
list_c = []
for l in list_b:
list_c.append(l.mean())
result=pd.DataFrame(list_c)

Does this help?
df.groupby(['Columnname', 'Columnname2']).mean()
You may need to create intermediate dataframes for each step. Can you provide an example of the output you want?

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

Mean of values in some columns with Pandas/Numpy

I've just started with Pandas and Numpy a couple of months ago and I've learned already quite a lot thanks to all the threads here. But now I can't find what I need.
For work, I have created an excel sheet that calculates some figures to be used for re-ordering inventory. To practice and maybe actually use it, I'd wanted to give it a try to replicate the functionality in Python. Later I might want to add some more sophisticated calculations with the help of Scikit-learn.
So far I've managed to load a csv with sales figures from our ERP into a dataframe, calculate mean and std. The calculations have been done on a subset of the data because I don't know how to apply calculations only to the specific columns. The csv does also contain for example product codes and leadtimes and these should not be used for the average and std calculations. Not sure yet also how to merge this subset back with the original dataframe.
The reason why I didn't hardcode the column names is because the ERP reports the sales number over the past x no. of months, so the order of the columns will change througout the year and I want to keep them in chronological order.
My data from the csv looks like:
"code","leadtime","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec"
"001.002",60,299,821,351,614,246,957,968,939,125,368,727,231
"001.002",25,340,274,733,575,904,953,614,268,638,960,617,757
"001.002",130,394,327,435,767,377,699,424,951,972,717,317,264
What I've done so far and what is working fine. (This can be doe probably much easier/more efficient):
import numpy as np
import timeit
import csv
import pandas as pd
sd = 1
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Get no of columns and substract 2 for compcode and leadtime
cols = df.shape[1] - 2
# Create a subset and count the columns
df_subset = df.iloc[:, -cols:]
subset_cols = df_subset.shape[1]
# Add columns for std dev and average
df_subset = (df_subset.assign(mean=df_subset.mean(axis=1),
stddev=df_subset.std(axis=1, ddof=0))
)
# Add columns for min and max values based on mean +/- std multiplied by factor sd
df_subset = (df_subset.assign(minSD=df_subset['mean'].sub(df_subset['stddev'] * sd),
maxSD=df_subset['mean'].add(df_subset['stddev'] * sd))
df_subset
Which gives me:
jan feb mar apr may jun jul aug sep oct nov dec mean stddev minSD maxSD
0 299 821 351 614 246 957 968 939 125 368 727 231 553.833333 304.262998 249.570335 858.096332
1 340 274 733 575 904 953 614 268 638 960 617 757 636.083333 234.519530 401.563804 870.602863
2 394 327 435 767 377 699 424 951 972 717 317 264 553.666667 242.398203 311.268464 796.064870
However for my next calculation I'm stuck again:
I want to calculate the average over values from the "month" columns and only the values that match the condition >= minSD and <= maxSD
So for row 0, I'm looking for the value (299+821+351+614+368+727)/6 = 530
How can I achieve this?
I've tried this, but this doesn't seem to work:
df_subset = df_subset.assign(avgwithSD=df_subset.iloc[:,0:subset_cols].values(where(df_subset.values>=df_subset['minSD'] & df_subset.values>=df_subset['maxSD'])).mean(axis=1))
Some help would be very welcome. Thanks
EDIT: With help I ended up using this to get further with my program
import numpy as np
import timeit
import csv
import pandas as pd
# sd will determine if range will be SD1 or SD2
sd = 1
# file to use
csv_in = "data_in.csv"
csv_out = "data_out.csv"
# Function to calculate the mean of the values within the range between minSD and maxSD
def CalcMeanSD(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
# Use Pandas
df = pd.read_csv(csv_in,dtype={'code': str})
# Define the month/data columns and set them to floatvalues
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
# Add columns for stddev and mean. Based on these values set new range between minSD and maxSD
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# Add column with the mean of the new range
df['avgwithSD'] = np.nanmean(df.apply(CalcMeanSD, axis=1), axis=1)
df
Result is:
code leadtime jan feb mar apr may jun jul aug sep oct nov dec stddev mean minSD maxSD avgwithSD
0 001.002 60 299.0 821.0 351.0 614.0 246.0 957.0 968.0 939.0 125.0 368.0 727.0 231.0 304.262998 553.833333 249.570335 858.096332 530.000000
1 001.002 25 340.0 274.0 733.0 575.0 904.0 953.0 614.0 268.0 638.0 960.0 617.0 757.0 234.519530 636.083333 401.563804 870.602863 655.666667
2 001.002 130 394.0 327.0 435.0 767.0 377.0 699.0 424.0 951.0 972.0 717.0 317.0 264.0 242.398203 553.666667 311.268464 796.064870 495.222222
3 001.002 90 951.0 251.0 411.0 469.0 359.0 220.0 192.0 250.0 818.0 768.0 937.0 128.0 292.572925 479.500000 186.927075 772.072925 365.000000
4 001.002 35 228.0 400.0 46.0 593.0 61.0 293.0 5.0 203.0 850.0 506.0 37.0 631.0 264.178746 321.083333 56.904588 585.262079 281.833333
5 001.002 10 708.0 804.0 208.0 380.0 531.0 125.0 500.0 773.0 354.0 238.0 805.0 215.0 242.371773 470.083333 227.711560 712.455106 451.833333
6 001.002 14 476.0 628.0 168.0 946.0 29.0 324.0 3.0 400.0 981.0 467.0 459.0 571.0 295.814225 454.333333 158.519109 750.147558 436.625000
7 001.002 14 92.0 906.0 18.0 537.0 57.0 399.0 544.0 977.0 909.0 687.0 881.0 459.0 333.154577 538.833333 205.678756 871.987910 525.200000
8 001.002 90 487.0 634.0 5.0 918.0 158.0 447.0 713.0 459.0 465.0 643.0 482.0 672.0 233.756447 506.916667 273.160220 740.673113 555.777778
9 001.002 130 741.0 43.0 976.0 461.0 35.0 321.0 434.0 8.0 330.0 32.0 896.0 531.0 326.216782 400.666667 74.449885 726.883449 415.400000
EDIT:
Instead of your original code:
# first part:
months_cols = df.columns[2:]
df.loc[:, months_cols] = df.loc[:, months_cols].astype('float64')
df['stddev'] = df.loc[:,months_cols].std(axis=1, ddof=0)
df['mean'] = df.loc[:, months_cols].mean(axis=1)
df['minSD'] = df['mean'].sub(df['stddev'] * sd)
df['maxSD'] = df['mean'].add(df['stddev'] * sd)
# second part: (the one that doesn't work for you)
def calc_mean_per_row_by_condition(row):
months_ = row[2:14]
min_SD = row[-2]
max_SD = row[-1]
return months_[(months_ >= min_SD) & (months_ <= max_SD)]
df['avgwithSD'] = np.nanmean(df.apply(calc_mean_per_row_by_condition, axis=1), axis=1)

Parsing and create a new df with conditions

I need some help with python and pandas.
I actually have a dataframe with in the column seq1_id al the seq_id of sequences of the species 1 and the column 2 for the sequences of the sp2.
I actually passed a filter on those sequences and got two dataframes (one with all sequences of sp 1 passed through the filter) and (one with all sequences of sp2 passed through the filter).
Then I have 3 dataframes.
Because in a pairs, one seq can pass the filter while the other does not, it is important to keep only paired genes which are keeping on the two previous filtering, so what I need to do is actually to parse my first df such this one:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
and check row by row if (ex the first row) seq1_A is present in the df2 and if seq8_B is also present in the df3, then keep this row in the df1 and add it in a new df4.
Here is an example with output wanted:
first df:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
seq3_A Seq10_B
seq4_A Seq11_B
df2 (sp1) (seq3_A is absent)
Seq_1.id
seq1_A
seq2_A
seq4_A
df3 (sp2) (Seq11_B is absent)
Seq_2.id
seq8_B
Seq9_B
Seq10_B
Then because Seq11_B and seq3_A are not present, the df4 (output) would be:
Seq_1.id Seq_2.id
seq1_A seq8_B
seq2_A Seq9_B
candidates_0035=pd.read_csv("candidates_genes_filtering_0035",sep='\t')
candidates_0042=pd.read_csv("candidates_genes_filtering_0042",sep='\t')
dN_dS=pd.read_csv("dn_ds.out_sorted",sep='\t')
df4 =dN_dS[dN_dS['seq1_id'].isin(candidates_0042['gene'])&dN_dS['seq2_id'].isin(candidates_0035['gene'])]
and I got an empty output, only with columns names but it should not be like that.
Here are the data if you cant to test the code on it :
df1:
Unnamed: 0 seq1_id seq2_id dN dS Dist_third_pos Dist_brute Length_seq_1 Length_seq_2 GC_content_seq1 GC_content_seq2 GC Mean_length
0 0 g66097.t1_0035_0035 g13600.t1_0042_0042 0.10455938989199982 0.3122332927029104 0.23600000000000002 0.142 535.0 1024.0 49.1588785046729 51.171875 50.165376752336456 535.0
1 1 g45594.t1_0035_0035 g1464.t1_0042_0042 0.5208761055250978 5.430485421797574 0.7120000000000001 0.489 246.0 222.0 47.967479674796756 44.594594594594604 46.28103713469567 222.0
2 2 g50055.t1_0035_0035 g34744.t1_0042_0035 0.08040473491714645 0.4233916132491867 0.262 0.139 895.0 749.0 56.312849162011176 57.67690253671562 56.994875849363396 749.0
3 3 g34020.t1_0035_0035 g12096.t1_0042_0042 0.4385191689737516 26.834927363887587 0.5760000000000001 0.433 597.0 633.0 37.85594639865997 39.810426540284354 38.83318646947217 597.0
4 4 g28436.t1_0035_0042 g35222.t1_0042_0035 0.055299811368483165 0.1181241496387666 0.1 0.069 450.0 461.0 45.111111111111114 44.90238611713666 45.006748614123886 450.0
5 5 g1005.t1_0035_0035 g11524.t1_0042_0042 0.3528036631463747 19.32549458735676 0.71 0.512 3177.0 3804.0 39.06200818382121 52.944269190325976 46.0031386870736 3177.0
6 6 g28456.t1_0035_0035 g31669.t1_0042_0035 0.4608959702286786 26.823981621115166 0.6859999999999999 0.469 516.0 591.0 49.224806201550386 53.46869712351946 51.346751662534935 516.0
7 7 g6202.t1_0035_0035 g193.t1_0042_0042 0.4679458383555545 17.81312422445775 0.66 0.462 804.0 837.0 41.91542288557214 47.67025089605735 44.79283689081474 804.0
8 8 g60667.t1_0035_0035 g14327.t1_0042_0042 0.046056273155280165 0.13320612138898 0.122 0.067 348.0 408.0 56.89655172413793 55.392156862745104 56.1443542934415 348.0
9 9 g30148.t1_0035_0042 g37790.t1_0042_0035 0.05631607180881047 0.19747150378706246 0.12300000000000001 0.08800000000000001 405.0 320.0 59.012345679012356 58.4375 58.72492283950618 320.0
10 10 g24481.t1_0035_0035 g37405.t1_0042_0035 0.2151957757290965 0.15106487998618026 0.135 0.17600000000000002 270.0 276.0 51.111111111111114 51.44927536231884 51.28019323671497 270.0
11 11 g33270.t1_0035_0035 g21201.t1_0042_0035 0.2773062983971916 21.13839474189674 0.6940000000000001 0.401 297.0 357.0 54.882154882154886 50.42016806722689 52.65116147469089 297.0
12 12 EOG090X03YJ_0035_0035_1 EOG090X03YJ_0042_0042_1 0.5402471721616758 19.278839157918302 0.7070000000000001 0.488 1321.0 1719.0 38.53141559424678 43.92088423502036 41.22614991463357 1321.0
13 13 g13075.t1_0035_0042 g504.t1_0042_0035 0.3317504066721263 4.790120127840871 0.65 0.38799999999999996 372.0 408.0 59.40860215053763 51.470588235294116 55.43959519291587 372.0
14 14 g1026.t1_0035_0035 g7716.t1_0042_0042 0.21445770772761286 13.92799368027682 0.626 0.344 336.0 315.0 38.095238095238095 44.444444444444436 41.26984126984127 315.0
15 15 g18238.t1_0035_0042 g35401.t1_0042_0035 0.3889830456691637 20.33679494952895 0.6759999999999999 0.44799999999999995 320.0 366.0 50.9375 49.453551912568315 50.19552595628416 320.0
df2:
Unnamed: 0 gene scaf_name start end cov_depth GC
179806 g13600.t1_0042_0042 scaffold_6556 1 1149 2.42361684558216 0.528846153846154
315037 g34744.t1_0042_0035 scaffold_8076 17 765 3.49803921568627 0.386138613861386
317296 g35222.t1_0042_0035 scaffold_9018 1 614 93.071661237785 0.41
183513 g14327.t1_0042_0042 scaffold_9358 122 529 3.3184165232357996 0.36
328164 g37790.t1_0042_0035 scaffold_16356 1 320 2.73125 0.436241610738255
326617 g37405.t1_0042_0035 scaffold_14890 1 341 1.3061224489795902 0.36898395721925104
188515 g15510.t1_0042_0042 scaffold_20183 1 276 137.326086956522 0.669354838709677
184561 g14562.t1_0042_0042 scaffold_10427 1 494 157.993927125506 0.46145940390544704
290684 g30982.t1_0042_0035 scaffold_3800 440 940 174.499839537869 0.39823008849557506
179993 g13632.t1_0042_0042 scaffold_6654 29 1114 3.56506849315068 0.46153846153846206
181670 g13942.t1_0042_0042 scaffold_7830 1 811 5.307028360049321 0.529411764705882
196148 g20290.t1_0042_0035 scaffold_1145 2707 9712 78.84112231766741 0.367283950617284
313624 g34464.t1_0042_0035 scaffold_7610 1 480 7.740440324449589 0.549019607843137
303133 g32700.t1_0042_0035 scaffold_5119 1735 2373 118.436578171091 0.49074074074074103
df3:
Unnamed: 0 gene scaf_name start end cov_depth GC
428708 g66097.t1_0035_0035 scaffold_306390 1 695 32.2431654676259 0.389880952380952
342025 g50055.t1_0035_0035 scaffold_188566 15 954 7.062893081761009 0.351129363449692
214193 g28436.t1_0035_0042 scaffold_231066 1 842 25.9774346793349 0.348837209302326
400337 g60667.t1_0035_0035 scaffold_261197 309 656 15.873529411764698 0.353846153846154
224023 g30148.t1_0035_0042 scaffold_263686 10 414 23.2072538860104 0.34108527131782895
184987 g24481.t1_0035_0035 scaffold_65047 817 1593 27.7840552416824 0.533898305084746
249413 g34492.t1_0035_0035 scaffold_106432 1 511 3.2482544608223396 0.368318122555411
249418 g34493.t1_0035_0035 scaffold_106432 547 1230 3.2482544608223396 0.368318122555411
12667 g1120.t1_0035_0042 scaffold_2095 2294 2794 47.864745898359295 0.56203288490284
252797 g35042.t1_0035_0035 scaffold_108853 274 1276 20.269592476489 0.32735426008968604
255878 g36112.t1_0035_0042 scaffold_437464 1 540 74.8252551020408 0.27884615384615397
40058 g4082.t1_0035_0042 scaffold_11195 579 1535 33.4396168320219 0.48487467588591204
271053 g39343.t1_0035_0042 scaffold_590976 1 290 19.6666666666667 0.38636363636363596
89911 g10947.t1_0035_0035 scaffold_21433 1735 2373 32.4222503160556 0.408571428571429
This should do it:
df4 = df1[df1['Seq_1.id'].isin(df2['Seq_1.id'])&df1['Seq_2.id'].isin(df3['Seq_2.id'])]
df4
# Seq_1.id Seq_2.id
#0 seq1_A seq8_B
#1 seq2_A Seq9_B
EDIT
You must have made a permutation, this doesn't return empty:
df4 = dN_dS[(dN_dS['seq1_id'].isin(candidates_0035['gene']))&(dN_dS['seq2_id'].isin(candidates_0042['gene']))]

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

Categories