Combine column counts from different dataframes pandas - python

I have two dataframes. One has demographics information about patients and other has some feature information. Below is some dummy data representing my dataset:
Demographics:
demographics = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0]
}
demographics = pd.DataFrame(demographics)
demographics['DOB'] = pd.to_datetime(demographics['DOB'])
Here is the printed dataframe:
print(demographics)
PatientID DOB Sex Flag
0 10 1971-10-23 M 0
1 11 1969-06-18 M 1
2 12 1973-04-20 F 0
3 13 1971-05-31 M 0
Features:
features = {
'PatientID': [10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
'Feature': ['A', 'B', 'A', 'A', 'C', 'B', 'C', 'A', 'B', 'B', 'A', 'C', 'D', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'B', 'C', 'C', 'C', 'B', 'B', 'C'],
}
features = pd.DataFrame(features)
Here is a count of each features of each patient:
print(features.groupby(['PatientID', 'Feature']).size())
PatientID Feature
10 A 3
B 2
C 2
11 A 3
B 3
C 3
D 1
12 B 3
C 4
D 3
dtype: int64
I want to integrate each patients feature counts of their features into the demographics table. Note that patient 13 is absent from the features table. The final dataframe will look as shown below:
result = {
'PatientID': [10, 11, 12, 13],
'DOB': ['1971-10-23', '1969-06-18', '1973-04-20', '1971-05-31'],
'Feature_A': [3, 3, 0, 0],
'Feature_B': [2, 3, 3, 0],
'Feature_C': [2, 3, 4, 0],
'Feature_D': [0, 1, 3, 0],
'Sex': ['M', 'M', 'F', 'M'],
'Flag': [0, 1, 0, 0],
}
result = pd.DataFrame(result)
result['DOB'] = pd.to_datetime(result['DOB'])
print(result)
PatientID DOB Feature_A Feature_B Feature_C Feature_D Sex Flag
0 10 1971-10-23 3 2 2 0 M 0
1 11 1969-06-18 3 3 3 1 M 1
2 12 1973-04-20 0 3 4 3 F 0
3 13 1971-05-31 0 0 0 0 M 0
How can I get this result from these two dataframes?

Cross-tabulate features and merge with demographics.
# cross-tabulate feature df
# and reindex it by PatientID to carry PatientIDs without features
feature_counts = (
pd.crosstab(features['PatientID'], features['Feature'])
.add_prefix('Feature_')
.reindex(demographics['PatientID'], fill_value=0)
)
# merge the two
demographics.merge(feature_counts, on='PatientID')

Fix your code adding unstack
out = (features.groupby(['PatientID', 'Feature']).size().
unstack(fill_value=0).
add_prefix('Feature_').
reindex(demographics['PatientID'],fill_value=0).
reset_index().
merge(demographics))
Out[30]:
PatientID Feature_A Feature_B Feature_C Feature_D DOB Sex Flag
0 10 3 2 2 0 1971-10-23 M 0
1 11 3 3 3 1 1969-06-18 M 1
2 12 0 3 4 3 1973-04-20 F 0
3 13 0 0 0 0 1971-05-31 M 0

Related

How can I group a Pandas DataFrame by Local Minima?

Originally, I had a Pandas DataFrame that consists of two columns A (for x-axis values) and B (for y-axis values) that are plotted to form a simple x-y coordinate graph. The data consisted of a few peaks, where the peaks all occurred at the same y-axis value with the same increments. Thus, I was able to do the following:
df = pd.read_csv(r'/Users/_______/Desktop/Data Packets/Cycle Data.csv')
nrows = int(df['B'].max() * 2) - 1
alphabet: list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
groups = df.groupby(df.index // nrows)
for (frameno, frame) in groups:
frame.to_csv("/Users/_______/Desktop/Cycle Test/" + alphabet[frameno] + "%s.csv" % frameno, index=False)
The above code parses the large cycle data file into many data files of the same size, since the local minima and maxima of each cycle is the same.
However, I want to be able to parse a data file that has arbitrary peaks and minima. I can't split the large data file simultaneously because each data file is a different size. Here is an example illustration:
Edit: sample data (A is x-axis, B is y-axis):
data = {'A': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], 'B': [0, 1, 2, 3, 4, 5, 6, 7, 5, 3, 1, -1, 1, 3, 5, 7, 9, 8, 7, 6, 5, 4, 6, 8, 6, 4, 2]}
df = pd.DataFrame(data)
Edit 2: different sample data (Displacement goes from 1 to 50 back to 1, then 1 to 60 back to 1, etc. etc.):
Load Displacement
0 0.100000 1.0
1 0.101000 2.0
2 0.102000 3.0
3 0.103000 4.0
4 0.104000 5.0
.. ... ...
391 0.000006 5.0
392 0.000005 4.0
393 0.000004 3.0
394 0.000003 2.0
395 0.000002 1.0
col = df['B'] # replace with the appropriate column name
# find local minima. FIXED: use rightmost min value if repeating
minima = (col <= col.shift()) & (col < col.shift(-1))
# create groups
groups = minima.cumsum()
# group
df.groupby(groups).whatever() # replace with whatever the appropriate aggregation is
Example, count values:
df.groupby(groups).count()
Out[10]:
A B
B
0 11 11
1 10 10
2 6 6
We can try with scipy , argrelextrema
from scipy.signal import argrelextrema
idx = argrelextrema(df.col.values, np.less)
g = df.groupby(df.index.isin(df.index[idx[0]]).cumsum())

Filter Model Prediction DataFrame

I have two dataframes:
lang_df = pd.DataFrame(data = {'Content ID': [1, 1, 2, 2, 3, 3],
'User ID': [10, 11, 10, 11, 10, 11],
'Language': ['A', 'A', 'B', 'B', 'C', 'C']})
pred_df = pd.DataFrame(data = {'Content ID': [4, 7, 14, 6, 6, 6],
'User ID': [10, 11, 10, 11, 10, 11],
'Language': ['A', 'D', 'Z', 'B', 'B', 'A']})
I want to filter out the rows in the second dataframe so that users only get content IDs in languages they have previously watched. Result for this example would look like:
result_df = pd.DataFrame(data = {'Content ID': [4, 6, 6, 6],
'User ID': [10, 11, 10, 11],
'Language': ['A', 'B', 'B', 'A']})
I know how to do it using a for loop, but this seems highly inefficient. Not sure how to make the DFs appear in the question for better clarity.
You need to have a inner join on the columns User ID and Language from lang_df with pred_df.
lang_df[['User ID', 'Language']].merge(pred_df, on=['User ID', 'Language'])
Output:
User ID Language Content ID
0 10 A 4
1 11 A 6
2 10 B 6
3 11 B 6
You can use inner join with merge to accomplish this, then use column filtering to on return columns from the pred_df dataframe:
pred_df.merge(lang_df, on=['User ID','Language'], suffixes=('','_2'))[pred_df.columns]
Output:
Content ID User ID Language
0 4 10 A
1 6 11 B
2 6 10 B
3 6 11 A

Merging two pandas dataframes by interval

I have two pandas dataframes with following format:
df_ts = pd.DataFrame([
[10, 20, 1, 'id1'],
[11, 22, 5, 'id1'],
[20, 54, 5, 'id2'],
[22, 53, 7, 'id2'],
[15, 24, 8, 'id1'],
[16, 25, 10, 'id1']
], columns = ['x', 'y', 'ts', 'id'])
df_statechange = pd.DataFrame([
['id1', 2, 'ok'],
['id2', 4, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
I am trying to get it to the format, such as:
df_out = pd.DataFrame([
[10, 20, 1, 'id1', None ],
[11, 22, 5, 'id1', 'ok' ],
[20, 54, 5, 'id2', 'not ok'],
[22, 53, 7, 'id2', 'not ok'],
[15, 24, 8, 'id1', 'ok' ],
[16, 25, 10, 'id1', 'not ok']
], columns = ['x', 'y', 'ts', 'id', 'state'])
I understand how to accomplish it iteratively by grouping by id and then iterating through each row and changing status when it appears. Is there a pandas build-in more scalable way of doing this?
Unfortunately pandas merge support only equality joins. See more details at the following thread:
merge pandas dataframes where one value is between two others
if you want to merge by interval you'll need to overcome the issue, for example by adding another filter after the merge:
joined = a.merge(b,on='id')
joined = joined[joined.ts.between(joined.ts1,joined.ts2)]
You can merge pandas data frames on two columns:
pd.merge(df_ts,df_statechange, how='left',on=['id','ts'])
in df_statechange that you shared here there is no common values on ts in both dataframes. Apparently you just copied not complete data frame here. So i got this output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 NaN
2 20 54 5 id2 NaN
3 22 53 7 id2 NaN
4 15 24 8 id1 NaN
5 16 25 10 id1 NaN
But indeed if you have common ts in the data frames it will have your desired output. For example:
df_statechange = pd.DataFrame([
['id1', 5, 'ok'],
['id1', 8, 'ok'],
['id2', 5, 'not ok'],
['id2',7, 'not ok'],
['id1', 9, 'not ok']
], columns = ['id', 'ts', 'state'])
the output:
x y ts id state
0 10 20 1 id1 NaN
1 11 22 5 id1 ok
2 20 54 5 id2 not ok
3 22 53 7 id2 not ok
4 15 24 8 id1 ok
5 16 25 10 id1 NaN

How to create structured arrays with list and array in Python?

I have a list A of the form:
A = ['P', 'Q', 'R', 'S', 'T', 'U']
and an array B of the form:
B = [[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]
[13 14 15 16 17 18]
[19 20 21 22 23 24]]
now I would like to create a structured array C of the form:
C = [[ P Q R S T U]
[ 1 2 3 4 5 6]
[ 7 8 9 10 11 12]
[13 14 15 16 17 18]
[19 20 21 22 23 24]]
so that I can extract columns with column names P, Q, R, etc. I tried the following code but it does not create a structured array and gives the following error.
Code
import numpy as np
A = (['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]])
C = np.vstack((A, B))
print (C)
D = C['P']
Error
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
How to create structured array in Python in this case?
Update
Both are variables, their shape changes during runtime but both list and array will have the same number of columns.
If you want to do it in pure numpy you can do
A = np.array(['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24]])
# define the structured array with the names from A
C = np.zeros(B.shape[0],dtype={'names':A,'formats':['f8','f8','f8','f8','f8','f8']})
# copy the data from B into C
for i,n in enumerate(A):
C[n] = B[:,i]
C['Q']
array([ 2., 8., 14., 20.])
Edit: you can automatize the format list by using instead
C = np.zeros(B.shape[0],dtype={'names':A,'formats':['f8' for x in range(A.shape[0])]})
Furthermore, the names do not appear in C as data but in dtype. In order to get the names from C you can use
C.dtype.names
This is what the pandas library is for:
>>> A = ['P', 'Q', 'R', 'S', 'T', 'U']
>>> B = np.arange(1, 25).reshape(4, 6)
>>> B
array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24]])
>>> import pandas as pd
>>> pd.DataFrame(B, columns=A)
P Q R S T U
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
3 19 20 21 22 23 24
>>> df = pd.DataFrame(B, columns=A)
>>> df['P']
0 1
1 7
2 13
3 19
Name: P, dtype: int64
>>> df['T']
0 5
1 11
2 17
3 23
Name: T, dtype: int64
>>>
http://pandas.pydata.org/pandas-docs/dev/tutorials.html
Your error occurs on:
D = C['P']
Here is a simple approach, using regular Python lists on the title row.
import numpy as np
A = (['P', 'Q', 'R', 'S', 'T', 'U'])
B = np.array([[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12],
[13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24]])
C = np.vstack((A, B))
print (C)
D = C[0:len(C), list(C[0]).index('P')]
print (D)

Iteratively build multi-index dataframe in pandas

I have n small dataframes which I combine into one multiindex dataframe.
d1 = DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr22'])
d2 = DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
d3 = DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['d', 10, 14],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
How to combine these into one dataframe?
Result:
name attr11 attr21 attr22
d1 a 5 NULL 9
b 4 NULL 61
c 24 NULL 9
d2 a NULL 5 19
b NULL 14 16
c NULL 4 9
d3 a NULL 5 19
b NULL 14 16
c NULL 4 9
d NULL 10 14
you can build the multiindex after the concatenation. You just need to add a column to each frame with the dataframe id:
frames =[d1,d2,d3]
Add a column to each frame with the frame id:
for x in enumerate(frames):
x[1]['frame_id'] = 'd'+str(x[0]+1)
then concatenate the list of frames and set the index on the desired columns:
pd.concat(frames).set_index(['frame_id','name'])

Categories