problem merging list and dataframe in Python - python

I have CSV files that I want to merge with list of struct(class) I made.
In the CSV I have field 'sector' and another field with information about this sector.
The array type is of a class I made with fields: name, x, y where x,y is the location that belong to this name.
This is how I defined the list(I generated it from CSV file as well which each antenna appear many time with different parameters so I extracted only those I need)
# ant_file is the CSV with all the antennas, ant_list_name is the list with
# only antennas name and ant_list_tot is the list with the name and also x,y
# fields
for rowA in range(size_ant_file):
rec = ant_file.iloc[rowA]['name']
if rec not in ant_lis_name:
ant_lis_name.append(rec)
A = Antenna(ant_file.iloc[rowA]['name'], ant_file.iloc[rowA]['x'],
ant_file.iloc[rowA]['y'])
ant_list_tot.append(A)
print(antenna_list)
[Antenna(name='UWE33', x=34.9, y=31.9), Antenna(name='UTN00', x=34.8,
y=32.1), Antenna(name='UWE02', x=34.8, y=32.1)]
I tried to do it with double for loop:
#dataclass
class Antenna:
name: str
x: float
y: float
# records is the csv file and antenna_list is the list of type Antenna
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
The result CSV file is partly right and at the end there are rows with all fields correctly except x and y fields which are 0 and some rows with x and y values but without the information of the original fields.
It seems like there is a big shift of rows but I can't understand why.
I checked that there are no missing values
example:
records.csv at the begining:(date,hour and user_id are random number and its not important)
sector date hour user_id x y
abc 1.1.19 20:00 123 0 0
dfs 5.8.17 12:40 876 0 0
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
antenna_list in the form of (name,x,y): (also here, x and y is random number right now and its not important)
antenna_list[0] = (abc,12,16)
antenna_list[1] = (dfs,6,20)
antenna_list[2] = (ngh,13,98)
antenna_list[3] = (yjt,18,41)
the result I want to see is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 13 98
yjt 10.10.16 17:18 492 18 41
abc 6.8.16 22:10 985 12 16
dfs 7.1.15 19:15 542 6 20
but the real result is:
sector date hour user_id x y
abc 1.1.19 20:00 123 12 16
dfs 5.8.17 12:40 876 6 20
ngh 6.9.19 08:12 962 0 0
yjt 10.10.16 17:18 492 0 0
abc 6.8.16 22:10 985 0 0
dfs 7.1.15 19:15 542 0 0
13 98
18 41
12 16
6 20
TIA

If you save antenna_list as two dicts,
antenna_dict_x = {'abc':12, 'dfs':6, 'ngh':13, 'yjt':18}
antenna_dict_y = {'abc':16, 'dfs':20, 'ngh':98, 'yjt':41}
then creating two columns should be an easy map,
data['x']=data['sector'].map(antenna_dict_x)
data['y']=data['sector'].map(antenna_dict_y)

So if you do:
import pandas as pd
class Antenna():
def __init__(self, name, x, y):
self.name = name
self.x = x
self.y = y
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is what you were expecting. Also, if you do:
import pandas as pd
from dataclasses import dataclass
#dataclass
class Antenna:
name: str
x: float
y: float
antenna_list = [Antenna('abc',12,16), Antenna('dfs',6,20), Antenna('ngh',13,98), Antenna('yjt',18,41)]
records = pd.read_csv('something.csv')
for index in range(len(records)):
rec = records.iloc[index]['sector']
for i in range(len(antenna_list)):
if rec == antenna_list[i].name:
lat = antenna_list[i].x
lon = antenna_list[i].y
records.at[index, 'x'] = lat
records.at[index, 'y'] = lon
break
print(records)
you get:
sector date hour user_id x y
0 abc 1.1.19 20:00 123 12 16
1 dfs 5.8.17 12:40 876 6 20
2 ngh 6.9.19 8:12 962 13 98
3 yjt 10.10.16 17:18 492 18 41
4 abc 6.8.16 22:10 985 12 16
5 dfs 7.1.15 19:15 542 6 20
Which is, again, what you were expecting. You did not post how you created the antenna list, but I assume that is where your error is.

Related

Replace blank value in dataframe based on another column condition

I have many blanks in a merged data set and I want to fill them with a condition.
My current code looks like this
import pandas as pd
import csv
import numpy as np
pd.set_option('display.max_columns', 500)
# Read all files into pandas dataframes
Jan = pd.read_csv(r'C:\~\Documents\Jan.csv')
Feb = pd.read_csv(r'C:\~\Documents\Feb.csv')
Mar = pd.read_csv(r'C:\~\Documents\Mar.csv')
Jan=pd.DataFrame({'Department':['52','5','56','70','7'],'Item':['2515','254','818','','']})
Feb=pd.DataFrame({'Department':['52','56','765','7','40'],'Item':['2515','818','524','','']})
Mar=pd.DataFrame({'Department':['7','70','5','8','52'],'Item':['45','','818','','']})
all_df_list = [Jan, Feb, Mar]
appended_df = pd.concat(all_df_list)
df = appended_df
df.to_csv(r"C:\~\Documents\SallesDS.csv", index=False)
Data set:
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7
40
7 45
70
5 818
8
52
What I want is to fill the empty cells in Item with a correspondent values of the Department column.
So If Department is 52 and Item is empty it should be filled with 2515
Department 7 and Item is empty fill it with 45
and the result should look like this
df
Department Item
52 2515
5 254
56 818
70
7 50
52 2515
56 818
765 524
7 45
40
7 45
70
5 818
8
52 2515
I tried the following method but non of them worked.
1
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(52)), 'Item'] = 2515
df.loc[(df['Item'].isna()) & (df['Department'].str.contains(7)), 'Item'] = 45
2
df["Item"] = df["Item"].fillna(df["Department"])
df = df.replace({"Item":{"52":"2515", "7":"45"}})
both ethir return error or do not work
Answer:
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue
The following solution first creates a map of each department and it's maximum corresponding item (assuming there is one), and then matches that item to a department with a blank item. Note that in your data frame, the empty items are an empty string ("") and not NaN.
Create a map:
values = df.groupby('Department').max()
values['Item'] = values['Item'].apply(lambda x: np.nan if x == "" else x)
values = values.dropna().reset_index()
Department Item
0 5 818
1 52 2515
2 56 818
3 7 45
4 765 524
Then use df.apply():
df['Item'] = df.apply(lambda x: values[values['Department'] == x['Department']]['Item'].values if x['Item'] == "" else x['Item'], axis=1)
In this case, the new values will have brackets around them. They can be removed with str.replace():
df['Item'] = df['Item'].astype(str).str.replace(r'\[|\'|\'|\]', "", regex=True)
The result:
Department Item
0 52 2515
1 5 254
2 56 818
3 70
4 7 45
0 52 2515
1 56 818
2 765 524
3 7 45
4 40
0 7 45
1 70
2 5 818
3 8
4 52 2515
Hi I have used the below code and it worked
b = [52]
df.Item=np.where(df.Department.isin(b),df.Item.fillna(2515),df.Item)
a = [7]
df.Item=np.where(df.Department.isin(a),df.Item.fillna(45),df.Item)
Hope it helps someone who face the same issue

Appending a dictionary to a dataframe as a new column

I'm very new to Python and was hoping to get some help. I am following an online example where the author creates a dictionary, adds some data to it and then appends this to his original dataframe.
When I follow the code the data in the dictionary doesn't get appended to the dataframe and as such I can't continue with the example.
The authors code is as follows:
from collections import defaultdict
won_last = defaultdict(int)
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['AwayTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
results.ix[index]=row
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
When I run this code I get the error message (note that the name of the dataframe is different but apart from that nothing has changed)
AttributeError Traceback (most recent call last)
<ipython-input-46-d31706a5f745> in <module>
4 row['HomeLastWin'] = won_last[home_team]
5 row['VisitorLastWin'] = won_last[visitor_team]
----> 6 data.ix[index]=row
7 won_last[home_team] = row['HomeWin']
8 won_last[visitor_team] = not row['HomeWin']
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5138 return self[name]
-> 5139 return object.__getattribute__(self, name)
5140
5141 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
If I change the row data.ix[index]=row to data.loc[index]=row the code runs ok but nothing happens to my dataframe
Below is an example of the dataset I am working with
Div Date Time HomeTeam AwayTeam FTHG FTAG FTR HomeWIn
E0 12/09/2020 12:30 Fulham Arsenal 0 3 A FALSE
E0 12/09/2020 15:00 Crystal Palace Southampton 1 0 H FALSE
E0 12/09/2020 17:30 Liverpool Leeds 4 3 H TRUE
E0 12/09/2020 20:00 West Ham Newcastle 0 2 A TRUE
E0 13/09/2020 14:00 West Brom Leicester 0 3 A FALSE
and below is the dataset of the example I am working through with the columns added
Date Visitor Team VisitorPts Home Team HomePts HomeWin
20 01/11/2013 Milwaukee 105 Boston 98 FALSE
21 01/11/2013 Miami Heat 100 Brooklyn 101 TRUE
22 01/11/2013 Clevland 84 Charlotte 90 TRUE
23 01/11/2013 Portland 113 Denver 98 FALSE
24 01/11/2013 Dallas 91 Houston 113 TRUE
HomeLastWin VisitorLastWIn
FALSE FALSE
FALSE FALSE
FALSE TRUE
FALSE FALSE
TRUE TRUE
Thanks
Jon
Could you please try this,
Data that used as dataset_stack.csv
from collections import defaultdict
won_last = defaultdict(int)
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'dataset_stack.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv("dataset_stack.csv")
results=pd.DataFrame(data=data)
#print(results)
# Preview the first 5 lines of the loaded data
#data.head()
for index,row in data.iterrows():
home_team = row['HomeTeam']
visitor_team = row['VisitorTeam']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
#results.ix[index]=row
#results.loc[index]=row
#add new column directly to dataframe instead of adding it to row & appending to dataframe
results['HomeLastWin']=won_last[home_team]
results['VisitorLastWin']=won_last[visitor_team]
results.append(row, ignore_index=True)
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
print(results)
Output:
Date VisitorTeam VisitorPts HomeTeam HomePts HomeWin \
0 1/11/2013 Milwaukee 105 Boston 98 False
1 1/11/2013 Miami Heat 100 Brooklyn 101 True
2 1/11/2013 Clevland 84 Charlotte 90 True
3 1/11/2013 Portland 113 Denver 98 False
4 1/11/2013 Dallas 91 Houston 113 True
HomeLastWin VisitorLastWin
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

Pandas: use apply to create 2 new columns

I have a dataset where col a represent the number of total values in values e,i,d,t which are in string format separated by a "-"
a e i d t
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
I want to create 8 new columns, 4 representing the SUM of (e-i-d-t), 4 the product.
For example:
def funct_two_outputs(E, I, d, t, d_calib = 50):
return E+i+d+t, E*i*d*t
OUT first 2 values:
SUM_0, row0 = 40+0.5+30+1 SUM_1 = 80+0.3+32+1
The sum and product are example functions substituting my functions which are a bit more complicated.
I have written out a function **expand_on_col ** that creates separates all the e,i,d,t values into new columns:
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
Now i need to create 4 new columsn that are the sum of eidt, and 4 that are the prodct.
Example output for SUM:
index a e i d t a-0 e-0 e-1 e-2 e-3 i-0 i-1 i-2 i-3 d-0 d-1 d-2 d-3 t-0 t-1 t-2 t-3 sum-0 sum-1 sum-2 sum-3
0 0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
1 1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
2 3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
3 5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
If i run the code with funct_one_output(only returns sum) it works, but wit the funct_two_outputs(suma and product) I get an error.
Here is the code:
import pandas as pd
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
def funct_two_outputs(E, I, d, t, d_calib = 50): #the function i want to pass
return E+i+d+t, E*i*d*t
def funct_one_outputs(E, I, d, t, d_calib = 50): #for now i can olny use this one, cant use 2 return values.
return E+i+d+t
for col in columns:
df = expand_on_col (df_=df, col_to_split = col, sep='-', prefix=f"{col}-")
cols_ = df.columns.drop(columns)
df[cols_]= df[cols_].apply(pd.to_numeric, errors="coerce")
df["a"] = df["a"].apply(pd.to_numeric, errors="coerce")
df.reset_index(inplace=True)
for i in range (max(df["a"])):
name_1, name_2 = f"sum-{i}", f"mult-{i}"
df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
#if i try and fill 2 outputs it wont work
df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-306-85157b89d696> in <module>()
68 df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
69 #if i try and fill 2 outputs it wont work
---> 70 df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
71
72
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __setitem__(self, key, value)
3039 self._setitem_frame(key, value)
3040 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3041 self._setitem_array(key, value)
3042 else:
3043 # set column
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3074 )[1]
3075 self._check_setitem_copy()
-> 3076 self.iloc._setitem_with_indexer((slice(None), indexer), value)
3077
3078 def _setitem_frame(self, key, value):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
1751 if len(ilocs) != len(value):
1752 raise ValueError(
-> 1753 "Must have equal len keys and value "
1754 "when setting with an iterable"
1755 )
ValueError: Must have equal len keys and value when setting with an iterable
Don't Use apply
If you can help it
s = pd.to_numeric(
df[['e', 'i', 'd', 't']]
.stack()
.str.split('-', expand=True)
.stack()
)
sums = s.sum(level=[0, 2]).rename('Sum')
prods = s.prod(level=[0, 2]).rename('Prod')
sums_prods = pd.concat([sums, prods], axis=1).unstack()
sums_prods.columns = [f'{o}-{i}' for o, i in sums_prods.columns]
df.join(sums_prods)
a e i d t Sum-0 Sum-1 Sum-2 Sum-3 Prod-0 Prod-1 Prod-2 Prod-3
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0

compare 2 dataframe with pandas

It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

Categories