How do you lookup in range - python
I have 2 data frames that I would like to return the values in a range (-1, 0, +1). One of the data frames contains Id's that i would like to look up and the other data frame contains Id's & values. For example, I want to lookup 99, 55, 117 in another data frame and return 100 99 98, 56 55 54, 118 117 116. As you can see it getting the values -1 and +1 of the Id's I would like to lookup. There is a better example below.
df = pd.DataFrame([[99],[55],[117]],columns = ['Id'])
df2 = pd.DataFrame([[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46],
[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10],
[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56],
[117,1,66,44,47,87,91]],
columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6'])
I would like my result to something like this below.
results = pd.DataFrame([[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46],
[64,1,10,14,29,32,33],
[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10],
[118,1,7,8,22,44,56],
[117,1,66,44,47,87,91]],
columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6'])
import pandas as pd
import numpy as np
ind = df2[df2['Id'].isin(df['Id'])].index
aaa = np.array([[ind[i]-1,ind[i],ind[i]+1] for i in range(len(ind))]).ravel()
aaa = aaa[(aaa <= df2.index.values[-1]) & (aaa >= 0)]
df_test = df2.loc[aaa, :].reset_index().drop(['index'], axis=1)
print(df_test)
Output
Id Num1 Num2 Num3 Num4 Num5 Num6
0 87 1 6 20 22 23 34
1 99 1 12 13 34 45 46
2 64 1 10 14 29 32 33
3 64 1 10 14 29 32 33
4 55 1 22 13 23 33 35
5 66 1 6 7 8 9 10
6 118 1 7 8 22 44 56
7 117 1 66 44 47 87 91
Here, in the ind list, indexes are obtained where there are the required Ids in df2.
The aaa list creates ranges for these indexes, then the lists are wrapped in np.array, ravel() is used to concatenate them. Next, the list aaa is overwritten, the elements that are greater than the maximum index df2 are removed.
Sampling occurs through loc.
Update 17.12.2022
if you need duplicate rows.
df = pd.DataFrame([[99], [55], [117], [117]], columns=['Id'])
lim_ind = df2.index[-1]
def my_func(i):
a = df2[df2['Id'].isin([i])].index.values
a = np.array([a - 1, a, a + 1]).ravel()
a = a[(a >= 0) & (a <= lim_ind)]
return a
qqq = [my_func(i) for i in df['Id']]
fff = np.array([df2.loc[qqq[i]].values for i in range(len(qqq))], dtype=object)
fff = np.vstack(fff)
result = pd.DataFrame(fff, columns=df2.columns)
print(result)
Output
Id Num1 Num2 Num3 Num4 Num5 Num6
0 87 1 6 20 22 23 34
1 99 1 12 13 34 45 46
2 64 1 10 14 29 32 33
3 64 1 10 14 29 32 33
4 55 1 22 13 23 33 35
5 66 1 6 7 8 9 10
6 118 1 7 8 22 44 56
7 117 1 66 44 47 87 91
8 118 1 7 8 22 44 56
9 117 1 66 44 47 87 91
Related
Pandas drop multiple in range value using isin
Given a df a 0 1 1 2 2 1 3 7 4 10 5 11 6 21 7 22 8 26 9 51 10 56 11 83 12 82 13 85 14 90 I would like to drop rows if the value in column a is not within these multiple range (10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop` lbot =[10, 25, 50, 80] ltop=[15, 30, 55, 85] I am thinking this can be achieve via pandas isin df[df['a'].isin(list(zip(lbot,ltop)))] But, it return empty df instead. The expected output is a 10 11 26 51 83 82 85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.: out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)] Output: a 4 10 5 11 8 26 9 51 11 83 12 82 13 85
Create values in flatten list comprehension with range: df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])] print (df) a 4 10 5 11 8 26 9 51 11 83 12 82 13 85 Or use np.concatenate for flatten list of ranges: df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between(): df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)] output: a 4 10 5 11 8 26 9 51 11 83 12 82 13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists: # group based on lower bound id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)), right=False) # include lower bound # group based on upper bound id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop))) # ensure groups are identical df[id1.eq(id2)] output: a 4 10 5 11 8 26 9 51 11 83 12 82 13 85 intermediate groups: a id1 id2 0 1 NaN 0 1 2 NaN 0 2 1 NaN 0 3 7 NaN 0 4 10 0 0 5 11 0 0 6 21 0 1 7 22 0 1 8 26 1 1 9 51 2 2 10 56 2 3 11 83 3 3 12 82 3 3 13 85 3 3 14 90 3 NaN
Venn Diagram for each row in DataFrame
I have a set of data that looks like this: Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8 0 1 58 32 58 58 14 40 18 18 1 2 55 38 44 55 28 34 10 10 2 4 95 69 83 95 37 51 32 32 3 5 92 68 84 92 31 47 37 37 4 6 0 0 0 0 0 0 0 0 5 7 71 52 65 71 27 40 25 25 6 8 84 69 69 84 39 39 30 30 7 10 65 35 63 65 17 45 18 18 8 11 90 72 72 90 39 39 33 33 9 14 88 84 80 88 52 48 32 32 10 17 89 56 75 89 30 49 26 26 11 19 83 56 70 83 32 46 24 24 12 20 94 72 83 93 35 46 37 37 13 21 73 57 56 73 38 37 19 19 For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap. I have tried a couple of things, the below code has gotten me the closest: from matplotlib import pyplot as plt import numpy as np from matplotlib_venn import venn2, venn2_circles import csv import pandas as pd import numpy as np val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv" val_tab = pd.read_csv(val_path) exp_num = val_tab['Exp #'] cols = ['Q1 unique','Q2 unique', 'Overlap'] df = pd.DataFrame() df ['Exp #'] = exp_num df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1) print(df) exp_no = df['Exp #'].tolist() combined = df['combined'].tolist() #combined = [int(i) for i in combined] print(combined) for a in exp_no: plt.figure(figsize=(4,4)) plt.title(a) for b in combined: v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7) v.get_label_by_id('A').set_text('Q1') c = venn2_circles(subsets=(b)) plt.show() plt.savefig(a + 'output.png') This generates a DataFrame: Exp # combined 0 1 14,40,18 1 2 28,34,10 2 4 37,51,32 3 5 31,47,37 4 6 0,0,0 5 7 27,40,25 6 8 39,39,30 7 10 17,45,18 8 11 39,39,33 9 14 52,48,32 10 17 30,49,26 11 19 32,46,24 12 20 35,46,37 13 21 38,37,19 However, I think I run into the issue when I export the combined column into a list: ['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19'] As after this I get the error: numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8') How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.
Pandas - merging start/end time ranges with short gaps
Say I have a series of start and end times for a given event: np.random.seed(1) df = pd.DataFrame(np.random.randint(1,5,30).cumsum().reshape(-1, 2), columns = ["start", "end"]) start end 0 2 6 1 7 8 2 12 14 3 18 20 4 24 25 5 26 28 6 29 33 7 35 36 8 39 41 9 44 45 10 48 50 11 53 54 12 58 59 13 62 63 14 65 68 I'd like to merge time ranges with a gap less than or equal to n, so for n = 1 the result would be: fn(df, n = 1) start end 0 2 8 2 12 14 3 18 20 4 24 33 7 35 36 8 39 41 9 44 45 10 48 50 11 53 54 12 58 59 13 62 63 14 65 68 I can't seem to find a way to do this with pandas without iterating and building up the result line-by-line. Is there some simpler way to do this?
You can subtract shifted values, compare by N for mask, create groups by cumulative sum and pass to groupby for aggregate max and min: N = 1 g = df['start'].sub(df['end'].shift()) df = df.groupby(g.gt(N).cumsum()).agg({'start':'min', 'end':'max'}) print (df) start end 1 2 8 2 12 14 3 18 20 4 24 33 5 35 36 6 39 41 7 44 45 8 48 50 9 53 54 10 58 59 11 62 63 12 65 68
How to randomly drop rows in Pandas dataframe until there are equal number of values in a column?
I have a dataframe pd with two columns, X and y. In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies: df[y].value_counts() 10 6645 9 6213 8 5789 7 4643 6 2532 5 1839 4 1596 3 878 2 815 1 642 I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label. I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency. As an example of a dataframe: df = pd.DataFrame() df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[]) df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby- df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y']) val_cnt = df['y'].value_counts() min_sample = val_cnt.min() print(min_sample) # Outputs 7 in as an example print(df.groupby('y').apply(lambda s: s.sample(min_sample))) Output y y 1 68 1 8 1 82 1 17 1 99 1 31 1 6 1 2 55 2 15 2 81 2 22 2 46 2 13 2 58 2 3 2 3 30 3 84 3 61 3 78 3 24 3 98 3 4 51 4 86 4 52 4 10 4 42 4 80 4 53 4 5 16 5 87 5 ... .. 6 26 6 18 6 7 56 7 4 7 60 7 65 7 85 7 37 7 70 7 8 93 8 41 8 28 8 20 8 33 8 64 8 62 8 9 73 9 79 9 9 9 40 9 29 9 57 9 7 9 10 96 10 67 10 47 10 54 10 97 10 71 10 94 10 [70 rows x 1 columns]
Find maximum value in python dataframe combining several rows
I have a dataframe looks like following(I have sorted it according to item column already). For example, item 1- 10,11-20,...(every 10 items) are in the same category, I want to find the item in each category that have the highest score and return it. What is the most efficient way to do that? item score 1 1 10 3 4 1 4 6 6 39 11 2 8 12 1 9 13 1 10 15 24 11 17 9 12 18 12 13 20 7 14 22 1 59 25 3 18 28 3 19 29 2 22 34 2 23 37 1 24 38 3 25 39 2 26 40 2 27 42 3 29 45 1 31 48 1 32 53 4 33 58 4
assuming your dataframe is stored in df g = df.groupby(pd.cut(df.item, np.arange(1, df.item.max(), 10), right=False) ) get the max values from each category max_score_ids = g.score.agg('idxmax') this gives you the ids of the rows that contain the max score in each category item [1, 11) 1 [11, 21) 10 [21, 31) 59 [31, 41) 24 [41, 51) 27 then get the items associated with these ids df.loc[max_score_ids].item 1 1 10 15 59 25 24 38 27 42