I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):
A
id
value1
1
80
1
78
2
76
2
79
B
id
value2
2
65
2
67
3
74
3
65
# to reproduce the data frames
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
What I'm trying to create:
id
value1
value2
2
76
65
2
79
67
I tried:
data = pd.merge(A, B, on="stay_id")
But the result is:
id
value1
value2
2
76
65
2
76
67
2
79
65
2
79
67
So the first value gets repeated along the axis. I also tried:
complete = A.copy()
complete["B" = B["value2"]
Does this ensure the values being matched for the id?
If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.
Considering there is a third dataframe (called dfC), you can try the code below :
id
value3
2
72
2
83
4
78
4
76
list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]
>>> print(out)
id value
2 2 76
3 2 79
4 2 65
5 2 67
8 2 72
9 2 83
After hours of trying I finally found a way using some of #Lucas M. Uriarte's logic, thanks a lot for that!
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
df1_patients = set(df1.index.values)
df2_patients = set(df2.index.values)
patients = set.intersection(df1, df2)
patients = list(patients)
reduced_df1 = df1.loc[patients]
reduced_df2 = df2.loc[patients]
reduced_df1.sort_index(inplace=True)
reduced_df2.sort_index(inplace=True)
data = reduced_df1.copy()
data["value2"] = reduced_df2["value2"]
This as far as I can see it ensures keeping only the entries that are in both data frames and matches the values row by row in this scenario.
I modify the answer according to the comments exchanged in the question, you need to find a unique identifier to merge from; since you have the number of measurements, this is "n_meassurement" and "stay_id", consider, for example, the following modification of the dataframes:
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65],
"number_measurement":["measurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79],
"number_measurement":["meassurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
output = pd.merge(df1, df2, on=["stay_id", "number_measurement"])
print(output)
Output:
stay_id value_x number_measurement value_y
0 2 65 measurement_1 76
1 2 67 measurement_2 79
now just drop the column number_measurement:
output.drop("number_measurement", axis=1)
stay_id value_x value_y
0 2 65 76
1 2 67 79
Related
I have a dataframe df_in, which contains column names that start with pi and pm.
df_in = pd.DataFrame([[1,2,3,4,"",6,7,8,9],["",1,32,43,59,65,"",83,97],["",51,62,47,58,64,74,86,99],[73,51,42,67,54,65,"",85,92]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
If a row in column name which starts with pi is blank, make the same rows of columns which starts with pm also blank till we have a new column which starts with pi. And repeat the same process for other columns also.
Expected Output:
df_out = pd.DataFrame([[1,2,3,4,"","",7,8,9],["","","","",59,65,"","",""],["","","","",58,64,74,86,99],[73,51,42,67,54,65,"","",""]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
How to do it?
You can create groups by compare columns names by str.startswith with cumulative sum and then compare values by empty spaces in groupby for mask used for set empty spaces in DataFrame.mask:
g = df_in.columns.str.startswith('pi').cumsum()
df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform(lambda x: x.iat[0]), '')
#first for me failed in pandas 1.2.3
#df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform('first'), '')
print (df)
piabc pmed pmrde pmret pirtc pmere piuyt pmfgf pmthg
0 1 2 3 4 7 8 9
1 59 65
2 58 64 74 86 99
3 73 51 42 67 54 65
I have dataframe which contains three columns like this:
id no. name
1A 32 ABC
4D 34 CFD
3B 32 DGA
and i want to shift the third column to consecutive next row like this:
1A 32
ABC
4D 34
CFD
3B 32
DGA
How this is possible in Python?
I tried via creating two dataframes one contains "id and no." and another one "name" and then merged them. But I did not like the output its not clean.
You can split out the name series into a new dataframe and manipulate the indices of the two resulting dataframes. Finally, use concat to combine them.
# split out dataframe of names
df_name = df.pop('name').to_frame('id')
df_name['no.'] = ''
# manipulate indices so they are non-overlapping
df_name.index = df_name.index * 2 + 1
df.index = df.index * 2
# concatenate two dataframes
res = pd.concat([df, df_name]).sort_index()
Result:
print(res)
id no.
0 1A 32
1 ABC
2 4D 34
3 CFD
4 3B 32
5 DGA
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()
I have two Data Frames with identical column names and identical IDs in the first column. With the exception of the ID column, every cell that contains a value in one DataFrame contains NaN in the other.
Here's an example of what they look like:
ID Cat1 Cat2 Cat3
1 NaN 75 NaN
2 61 NaN 84
3 NaN NaN NaN
ID Cat1 Cat2 Cat3
1 54 NaN 44
2 NaN 38 NaN
3 49 50 53
I want to merge them into one DataFrame while keeping the same Column Names. So the result would look like this:
ID Cat1 Cat2 Cat3
1 54 75 44
2 61 38 84
3 49 50 53
I tried:
df3 = pd.merge(df1, df2, on='ID', how='outer')
Which gave me a DataFrame containing twice as many columns. How can I merge the values from each DataFrame into one?
You probably want df.update. See the documentation.
df1.update(df2, raise_conflict=True)
In this case, the combine_first function is appropriate. (http://pandas.pydata.org/pandas-docs/version/0.13.1/merging.html)
As the name implies, combine_first takes the first DataFrame and adds to it with values from the second wherever it finds a NaN value in the first.
So:
df3 = df1.combine_first(df2)
produces a new DataFrame, df3, that is essentially just df1 with values from df2 filled in whenever possible.
You could also just change the NaN values in df1 with non-NaN values in df2.
df1[pd.isnull(df1)] = df2[~pd.isnull(df2)]