Python: Randomly select a subgroup in a group - python

I have a dataframe that looks like:
patient_id note_id lines
A 10 1
A 10 2
A 10 3
A 29 1
A 29 2
B 12 1
B 95 1
B 95 2
B 95 3
C......
D......
E 14 1
E 55 1
E 87 1
......
Each patient can have multiple notes and each note may contain more than 1 line. Say that I have 20 patients, 50 notes and 150 lines. How can I randomly select only one random note for randomly selected 3 patient? Say that I want one random note per randomly selected patient_id, I would get:
patient_id note_id lines
A 29 1
A 29 2
B 12 1
E 55 1

I'd suggest creating a temporary dataset without the lines column. Then .drop_duplicates() to get one line per note. Then invoke .sample() to choose your random subset, then .merge() to rejoin the sample to the original dataset on patient_id and note_id. There may well be a quicker way as I'm not a pandas expert.

Related

Efficient lookup between pandas column values and a list of values

I have a list of n elements lets say:
[5,30,60,180,240]
And a dataframe with the following characteristics
id1 id2 feat1
1 1 40
1 2 40
1 3 40
1 4 40
2 6 87
2 7 87
2 8 87
The combination of id1 + id2 is unique but all of the records with common id1 share the value of feat1. I would like to write a function to run it via groupby + apply (or whatever is faster) that creates a column called 'closest_number'. The 'closest_number' will be the closest element between the feat1 column for a given id1+id2 (or id1 as the records share feat1) and each of the elements of the list.
Desired output:
id1 id2 feat1 closest_number
1 1 40 30
1 2 40 30
1 3 40 30
1 4 40 30
2 6 87 60
2 7 87 60
2 8 87 60
If this will be a standard 2 array lookup problem I could do:
def get_closest(array, values):
# make sure array is a numpy array
array = np.array(array)
# get insert positions
idxs = np.searchsorted(array, values, side="left")
# find indexes where previous index is closer
prev_idx_is_less = ((idxs == len(array))|(np.fabs(values - array[np.maximum(idxs-1, 0)]) < np.fabs(values - array[np.minimum(idxs, len(array)-1)])))
idxs[prev_idx_is_less] -= 1
return array[idxs]
An if I apply this do the columns there I will get as output:
array([30, 60])
However I will not get any information about which indexes they have the correspondence with 30 and 60.
What will be the optimum way of doing this? As my list of elements is very small I have created distance columns in my dataset and then I have selected the one that gets me the min distances.
But I assume there should be a more elegant way of doing this.
BR
E
Use get_closest as follows:
# obtain the series with index id1 and values feat1
vals = df.groupby("id1")["feat1"].first().rename("closest_number")
# find the closest values and assign them back
vals[:] = get_closest(s, vals)
# merge the series into the original DataFrame
res = df.merge(vals, right_index=True, left_on="id1", how="left")
print(res)
Output
id1 id2 feat1 closest_number
0 1 1 40 30
1 1 2 40 30
2 1 3 40 30
3 1 4 40 30
4 2 6 87 60
5 2 7 87 60
6 2 8 87 60

Iteratively Capture Value Counts in Single DataFrame

I have a pandas dataframe that looks something like this:
id group gender age_grp status
1 1 m over21 active
2 4 m under21 active
3 2 f over21 inactive
I have over 100 columns and thousands of rows. I am trying to create a single pandas dataframe of the value_counts of each of the colums. So I want something that looks like this:
group1
gender m 100
f 89
age over21 98
under21 11
status active 87
inactive 42
Any one know a simple way I can iteratively concat the value_counts from each of the 100+ columns in the original dataset while capturing the name of the columns as a hierarchical index?
Eventually I want to be able to merge with another dataframe of a different group to look like this:
group1 group2
gender m 100 75
f 89 92
age over21 98 71
under21 11 22
status active 87 44
inactive 42 13
Thanks!
This should do it:
df.stack().groupby(level=1).value_counts()
id 1 1
2 1
3 1
group 1 1
2 1
4 1
gender m 2
f 1
age_grp over21 2
under21 1
status active 2
inactive 1
dtype: int64

Pandas: sum all rows

I have a DataFrame that looks like this:
score num_participants
0 20
1 15
2 5
3 10
4 12
5 15
I need to find the number of participants with score that is greater than or equal to the score in the current row:
score num_participants num_participants_with_score_greater_or_equal
0 20 77
1 15 57
2 5 42
3 10 37
4 12 27
5 15 15
So, I am trying to sum current row and all rows below it. The data has around 5000 rows, so I can't manually set it by indexing. cumsum doesn't do the trick and I am not sure if there is a simple way to do this. I have spent quite some time trying to solve this, so any help would be appreciated.
This is a reverse cumsum. Reverse the list, cumsum, then reverse back.
df.iloc[::-1].cumsum().iloc[::-1]
score num_participants
0 15 77
1 15 57
2 14 42
3 12 37
4 9 27
5 5 15
Unless score is already sorted, how about
df['num_participants_with_score_greater_or_equal'] = df.sort_values('score', ascending=False).num_participants.cumsum()
to make score is in the right order. You can restore the original order by .sort_index() after.

Pandas individual item using index and column

I have a csv file test.csv. I am trying to use pandas to select items dependent on whether the second value is above a certain value. Eg
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
So what i would like is if B is larger than 50 then give me the values in A as an integer which I could assign a variable to
edit 1:
Sorry for the poor explanation. The final purpose of this is that I want to look in table 1:
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
for any values above 50 in column B and get the column A value and then look in table 2:
index A B
5 44 12
6 45 13
7 46 14
8 47 15
9 48 16
so in the end i want to end up with the value in column B of table two which i can print out as an integer and not as a series. If this is not possible using panda then ok but is there a way to do it in any case?
You can use dataframa slicing, to get the values you want:
import pandas as pd
f = pd.read_csv('yourfile.csv')
f[f['B'] > 50].A
in this code
f['B'] > 50
is the condition, returning a booleans array of True/False for all values meeting the condition or not, and then the corresponding A values are selected
This would be the output:
2 46
3 47
Name: A, dtype: int64
Is this what you wanted?

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories