Comparing values in different pairs of columns in Pandas - python
I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2.
I have this dataframe:
,A,B,C,A2,B2,C2
2018-12-01,7,0,8,17,17,17
2018-12-02,0,0,8,20,18,18
2018-12-03,9,8,8,17,17,18
2018-12-04,8,8,8,17,17,18
2018-12-05,8,8,8,17,17,17
2018-12-06,9,8,8,15,17,17
2018-12-07,8,9,9,17,17,16
2018-12-08,0,0,0,17,17,17
2018-12-09,8,0,0,17,20,18
2018-12-10,8,8,8,17,17,17
2018-12-11,8,8,9,17,17,17
2018-12-12,8,8,8,17,17,17
2018-12-13,8,8,8,17,17,17
2018-12-14,8,8,8,17,17,17
2018-12-15,9,9,9,17,17,17
2018-12-16,12,0,0,17,19,17
2018-12-17,11,9,9,17,17,17
2018-12-18,8,9,9,17,17,17
2018-12-19,8,9,8,17,17,17
2018-12-20,9,8,8,17,17,17
2018-12-21,9,9,9,17,17,17
2018-12-22,10,9,0,17,17,17
2018-12-23,10,11,10,17,17,17
2018-12-24,10,10,8,17,19,17
2018-12-25,7,10,10,17,17,18
2018-12-26,10,0,10,17,19,17
2018-12-27,9,10,8,18,17,17
2018-12-28,9,9,9,17,17,17
2018-12-29,10,10,12,18,17,17
2018-12-30,10,0,10,16,19,17
2018-12-31,11,8,8,19,17,16
I expect the following value:
A with B = 14
A with C = 14
A2 with B2 = 14
A2 with C2 = 14
I have done this:
ia = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]:
ia=ia+1
ib = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]:
ib=ib+1
In order to take advantage of pandas, this is one possible solution:
import numpy as np
dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0)
After that I could sum all the elements in the new column 'que'.
Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that.
Any other more efficient or elegant solutions?
The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches.
Put that in a loop and add the required "text" for the output you want:
mains = ('A', 'A2') # the main columns
comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with
for main, pair in zip(mains, comps):
for col in pair:
print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}')
# or without f-strings, do:
# print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum())
Output:
A with B = 14
A with C = 14
A2 with B2 = 21
A2 with C2 = 20
Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum().
In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets:
mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3
comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's),
dfr_h_max1.columns[2::3]) # offset by 2 columns (C's),
# in steps of 3
(Or even using the column names / starting letter.)
Related
Python script to sum values according to conditions in a loop
I need to sum the value contained in a column (column 9) if a condition is satisfied: the condition is that it needs to be a pair of individuals (column 1 and column 3), whether they are repeated or not. My input file is made this way: Sindhi_HGDP00171 0 Tunisian_39T 0 1 120437718 147097266 3.02 7.111 Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 3.468 Sindhi_HGDP00183 1 Sindhi_HGDP00206 2 1 242708729 244766624 7.41 4.468 IBS_HG01768 2 Moroccan_MRA46 1 1 34186193 36027711 30.46 3.108 IBS_HG01710 1 Sardinian_HGDP01065 2 1 246117191 249120684 7.53 3.258 IBS_HG01768 2 Moroccan_MRA46 2 1 34186193 37320967 43.4 4.418 Therefore for instance, I would need the value of column 9 for each pair to be summed. Some of these pairs appear multiple time, in this case I would need the sum of value in column 9 betweem IBS_HG01768 and Moroccan_MRA46, and the sum of the value between Sindhi_HGDP00183 and Sindhi_HGDP00206. Some of these pairs are not repeated but I still need them to appear in the final results. What I manage so far is to sum by group (population), so I sum column 9 value by pair of population like Sindhi and Tunisian for instance. I need to do the sum by pairs of Individuals. My script is this: import pandas as pd import numpy as np import itertools # defines columns names cols = ['ID1', 'HAP1', 'ID2', 'HAP2', 'CHR', 'STARTPOS', 'ENDPOS', 'LOD', 'IBDLENGTH'] # loads data (the file needs to be in the same folder where the script is) data = pd.read_csv("./Roma_Ref_All_sorted.txt", sep = '\t', names = cols) # removes the sample ID for ID1/ID2 columns and places it in two dedicated columns data[['ID1', 'ID1_samples']] = data['ID1'].str.split('_', expand = True) data[['ID2', 'ID2_samples']] = data['ID2'].str.split('_', expand = True) # gets the groups list from both ID columns... groups_id1 = list(data.ID1.unique()) groups_id2 = list(data.ID2.unique()) groups = list(set(groups_id1 + groups_id2)) # ... and all the possible pairs group_pairs = [i for i in itertools.combinations(groups, 2)] # subsets the pairs having Roma group_pairs_roma = [x for x in group_pairs if ('Roma' in x[0] and x[0] != 'Romanian') or ('Roma' in x[1] and x[1] != 'Romanian')] # preapres output df result = pd.DataFrame(columns = ['ID1', 'ID2', 'IBD_sum']) # loops all the possible pairs and computes the sum of IBD length for idx, group_pair in enumerate(group_pairs_roma): id1 = group_pair[0] id2 = group_pair[1] ibd_sum = round(data.loc[((data['ID1'] == id1) & (data['ID2'] == id2)) | ((data['ID1'] == id2) & (data['ID2'] == id1)), 'IBDLENGTH'].sum(),3) result.loc [idx, ['ID1', 'ID2', 'IBD_sum']] = [id1, id2, ibd_sum] # saves results result.to_csv("./groups_pairs_sum_IBD.txt", sep = '\t', index = False) My current output is something like this: ID1 ID2 IBD_sum Sindhi IBS 3.275 Sindhi Moroccan 74.201 Sindhi Sindhi 119.359 While I need something like: ID1 ID2 IBD_sum Sindhi_individual1 Moroccan_individual1 3.275 Sindhi_individual2 Moroccan_individual2 5.275 Sindhi_individual3 IBS_individual1 4.275 I have tried by substituting one line in my code, by writing groups_id1 = list(data.ID1_samples.unique()) groups_id2 = list(data.ID2_samples.unique()) and later ibd_sum = round(data.loc[((data['ID1_samples'] == id1) & (data['ID2_samples'] == id2)) | ((data['ID1_samples'] == id2) & (data['ID2_samples'] == id1)), 'IBDLENGTH'].sum(),3) Which in theory should work because I set the individuals as pairs instead of populations as pairs, but the output was empty. What could I do to edit the code for what I need?
I have solved the problem on my own but using R language. This is the code: ibd <- read.delim("input.txt", sep='\t') ibd_sum_indv <- ibd %>% group_by(ID1, ID2) %>% summarise(SIBD = sum(IBDLENGTH), NIBD = n()) %>% ungroup()
find indices of ndarray compared with ndarray
I have two unsorted ndarrays with the following structure: a1 = np.array([0,4,2,3],[0,2,5,6],[2,3,7,4],[6,0,9,8],[9,0,6,7]) a2 = np.array([3,4,2],[0,6,9]) I would like to find all the indices of a1, where each a2 row is in a1 and also inside a1 the position: result = [[0,[3,1,2]],[2,[1,3,0]],[3,[1,0,2]],[4,[1,2,0]] In this example a2[0] is in a1 at position 0 and 2 within a1 position at 3,1,2 and 1,3,0. For a2[1] at position 3 and 4 within a1 position at 1,0,2 and 1,2,0. Each a2 row appears twice in a1. a1 has a least 1Mio. rows, a2 around 10,000. So the algorithm should be also quite fast (if possible). So far, i was thinking about this approach: big_res = [] for r in xrange(len(a2)): big_indices = np.argwhere(a1 == a2[r]) small_res = [] for k in xrange(2): small_indices = [i for i in a2[r] if i in a1[big_indices[k]]] np.append(small_res, small_indices) combined_res = [[big_indices[0],small_res[0]],[big_indices[1],small_res[1]]] np.append(big_res, combined_res)
Using numpy_indexed, (disclaimer: I am its author) what I think of as the hard part can be written efficiently as follows: import numpy_indexed as npi a1s = np.sort(a1, axis=1) a2s = np.sort(a2, axis=1) matches = np.array([npi.indices(a2s, np.delete(a1s, i, axis=1), missing=-1) for i in range(4)]) rows, cols = np.argwhere(matches != -1).T a1idx = cols a2idx = matches[rows, cols] # results.shape = [len(a2), 2] result = npi.group_by(a2idx).split_array_as_array(a1idx) This only gives you the matches efficiently; not the relative orders. But once you have the matches, computing the relative orders should be simple to do in linear time. EDIT: and some code of questionable density to get your relative orderings: order = npi.indices( (np.indices(a1.shape)[0].flatten(), a1.flatten()), (np.repeat(result.flatten(), 3), np.repeat(a2, 2, axis=0).flatten()) ).reshape(-1, 2, 3) - result[..., None] * 4
Advanced groupby column creation in pandas Dataframe
I have a catalogue of groups of galaxies in a DataFrame, 'compact', which consists mainly in a group id ('CG', int), a magnitude ('R', negative float) and a morphology ('Morph', string, for example 'S' or 'E'). I'm trying to construct a second pandas DataFrame with the following properties of the groups: 'Morph' of the object having the lowest 'R' in the group Difference between the second lowest and the lowest 'R' in the group Difference between the lowest 'R' in the group and R of the group, defined as -2.5*log10(sum(10**(-0.4*R))) Proportions of objects having a given 'Morph' (on column for 'S', one for other morphologies, for example) in the group, NOT COUNTING THE ONE HAVING THE LOWEST 'R'. I'm having troubles for the last one, could you help me to write it? The other ones work, but, as a secondary question, I would like if I'm doing it clean or if there's better to do. Here is my code (with a line for my last column which works but doesn't give exactly what I want, and a try in comments which doesn't work): GroupBy = compact.sort_values('R').groupby('CG', as_index=False) R2 = GroupBy.head(2).groupby('CG', as_index=False).last().R R1 = GroupBy.first().sort_values('CG').R DeltaR12 = R2 - R1 MorphCen = GroupBy.first().sort_values('CG').Morph Group = GroupBy.first().sort_values('CG').CG RGroup = GroupBy.apply(lambda x: -2.5*np.log10((10**(-0.4*x.R)).sum())) DeltaR1gr = R1 - RGroup # Works, but counts the object with lowest R: PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S'].shape[0]/x.shape[0]) # Tries to let aside lowest R, but doesn't work: # PropS = GroupBy.apply(lambda x: 1.0*x.loc[x['Morph'] == 'S' & # x['R']>x['R'].min()].shape[0]/x.shape[0]) # PropRed = same than PropS, but for 'Morph' != 'S' CompactML = pd.DataFrame([Group,MorphCen,DeltaR12,DeltaR1gr]).transpose() CompactML.columns = ['CG', 'MorphCen', 'DeltaR12','DeltaR1gr']
First, its nice if you provide actual data or create some fake data. Below I have created some fake data with 5 different integer CG groups, 2 types of morphology (S and E) and random negative numbers for 'R'. I have then redone all your aggregations in a custom function that computes each of the 4 returning aggregations in one line and sends the results back as a Series which adds each output as row to your original DataFrame. #create fake data df = pd.DataFrame({'CG':np.random.randint(0, 5, 100), 'Morph':np.random.choice(['S', 'E'], 100), 'R':np.random.rand(100) * -100}) print(df.head()) CG Morph R 0 3 E -72.377887 1 2 E -26.126565 2 0 E -4.428494 3 0 E -2.055434 4 4 E -93.341489 # define custom aggregation function def my_agg(x): x = x.sort_values('R') morph = x.head(1)['Morph'].values[0] diff = x.iloc[0]['R'] - x.iloc[1]['R'] diff2 = -2.5*np.log10(sum(10**(-0.4*x['R']))) prop = (x['Morph'].iloc[1:] == 'S').mean() return pd.Series([morph, diff, diff2, prop], index=['morph', 'diff', 'diff2', 'prop']) # apply custom agg function df.groupby('CG').apply(my_agg) morph diff diff2 prop CG 0 E -1.562630 -97.676934 0.555556 1 S -3.228845 -98.398337 0.391304 2 S -6.537937 -91.092164 0.307692 3 E -0.023813 -99.919336 0.500000 4 E -11.943842 -99.815734 0.705882
So, here is the final code, thanks to Ted Pertou: # define custom aggregation function def my_agg(x): x = x.sort_values('R') morph = x.head(1)['Morph'].values[0] diff = x.iloc[1]['R'] - x.iloc[0]['R'] diff2 = x.iloc[0]['R'] + 2.5*np.log10(sum(10**(-0.4*x['R']))) prop = (x['Morph'].iloc[1:] == 'S').mean() return pd.Series([morph, diff, diff2, prop], index=['MorphCen', 'DeltaR12', 'DeltaRGrp1', 'PropS']) # apply custom agg function compact.groupby('CG').apply(my_agg)
Pandas: Index of last non equal row
I have a pandas data frame F with a sorted index I. I am interested in knowing about the last change in one of the columns, let's say A. In particular, I want to construct a series with the same index as F, namely I, whose value at i is j where j is the greatest index value less than i such that F[A][j] != F[A][i]. For example, consider the following frame: A 1 5 2 5 3 6 4 2 5 2 The desired series would be: 1 NaN 2 NaN 3 2 4 3 5 3 Is there a pandas/numpy idiomatic way to construct this series?
Try this: df['B'] = np.nan last = np.nan for index, row in df.iterrows(): if index == 0: continue if df['A'].iloc[index] != df['A'].iloc[index - 1]: last = index df['B'].iloc[index] = last This will create a new column with the results. I believe that changing the rows as you pass through them is not a good idea, after that you can simply replace a column and delete the other if you wish.
np.argmax or pd.Series.argmax on Boolean data can help you find the first (or in this case, last) True value. You still have to loop over the series in this solution, though. # Initiate source data F = pd.DataFrame({'A':[5,5,6,2,2]}, index=list('fobni')) # Initiate resulting Series to NaN result = pd.Series(np.nan, index=F.index) for i in range(1, len(F)): value_at_i = F['A'].iloc[i] values_before_i = F['A'].iloc[:i] # Get differences as a Boolean Series # (keeping the original index) diffs = (values_before_i != value_at_i) if diffs.sum() == 0: continue # Reverse the Series of differences, # then find the index of the first True value j = diffs[::-1].argmax() result.iloc[i] = j
Finding the position of a subsequence in a sequence
If T1 is this: T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']}) and P is this: P = pd.DataFrame(data = {'val': ['E','A','D','B']}) how do I get the positions of P within T1 ? In terms of min and max I would like to see this returned min max 3 6 8 11 If these dataframes were represented as SQL tables I could use this SQL method translated to pandas: DECLARE #Items INT = (SELECT COUNT(*) FROM #P); SELECT MIN(t.KeyCol) AS MinKey, MAX(t.KeyCol) AS MaxKey FROM dbo.T1 AS t INNER JOIN #P AS p ON p.Val = t.Val GROUP BY t.KeyCol - p.KeyCol HAVING COUNT(*) = #Items; This SQL solution is from Pesomannen's reply to http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2
well, you can always do a workaround like this: t1 = ''.join(T1.val) p = ''.join(P.val) start, res = 0, [] while True: try: res.append(t1.index(p, start)) start = res[-1] + 1 except: break to get the starting indices and then figure out the ending indices by mathing and access the dataframe by using iloc. you should use 0-based indexing (not 1-based, like you do in the example)
Granted, this doesn't utilize P, but may serve your purposes. groups = T1.groupby(T1.val).groups pd.DataFrame({'min': [min(x) for x in groups.values()], 'max': [max(x) for x in groups.values()]}, index=groups.keys()) yields max min E 7 2 B 10 0 D 9 1 A 8 3 [4 rows x 2 columns]
I think I've worked it out by following the same approach as the SQL solution - a type of relational division (ie match up on the values, group by the differences in the key columns and select the group that has the count equal to the size of the subsequence): import pandas as pd T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']}) # use the index to create a new column that's going to be the key (zero based) T1 = T1.reset_index() # do the same for the subsequence that we want to find within T1 P = pd.DataFrame(data = {'val': ['E','A','D','B']}) P = P.reset_index() # join on the val column J = T1.merge(P,on=['val'],how='inner') # group by difference in key columns calculating the min, max and count of the T1 key FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'}) # Final result is where the count is the size of the subsequence - in this case 4 FullResult[FullResult['count'] == 4] Really enjoying using pandas !