I have two Data Frames with identical column names and identical IDs in the first column. With the exception of the ID column, every cell that contains a value in one DataFrame contains NaN in the other.
Here's an example of what they look like:
ID Cat1 Cat2 Cat3
1 NaN 75 NaN
2 61 NaN 84
3 NaN NaN NaN
ID Cat1 Cat2 Cat3
1 54 NaN 44
2 NaN 38 NaN
3 49 50 53
I want to merge them into one DataFrame while keeping the same Column Names. So the result would look like this:
ID Cat1 Cat2 Cat3
1 54 75 44
2 61 38 84
3 49 50 53
I tried:
df3 = pd.merge(df1, df2, on='ID', how='outer')
Which gave me a DataFrame containing twice as many columns. How can I merge the values from each DataFrame into one?
You probably want df.update. See the documentation.
df1.update(df2, raise_conflict=True)
In this case, the combine_first function is appropriate. (http://pandas.pydata.org/pandas-docs/version/0.13.1/merging.html)
As the name implies, combine_first takes the first DataFrame and adds to it with values from the second wherever it finds a NaN value in the first.
So:
df3 = df1.combine_first(df2)
produces a new DataFrame, df3, that is essentially just df1 with values from df2 filled in whenever possible.
You could also just change the NaN values in df1 with non-NaN values in df2.
df1[pd.isnull(df1)] = df2[~pd.isnull(df2)]
Related
how i select row in dataframe based on the last position for every user id. Is there any idea?
data=pd.DataFrame({'User_ID':['122','122','122','233','233','233','233','366','366','366'],'Age':[23,23,np.nan,24,24,24,24,21,21,np.nan]})
data
and the outcomes should be like this
data_new=pd.DataFrame({'User_ID':['122','233','366'],'Age':[np.nan,24,np.nan]})
so i just try to take the last row for every user_id. I'm totally beginner, is there any idea?
As you want to keep the NaN, you can groupby.tail (groupby.last would drop the NaNs):
out = data.groupby('User_ID').tail(1)
Another option is to drop_duplicates:
out = data.drop_duplicates(subset='User_ID', keep='last')
output:
User_ID Age
2 122 NaN
6 233 24.0
9 366 NaN
If you want to reset the index in the process use ignore_index=True:
out = data.drop_duplicates(subset='User_ID', keep='last', ignore_index=True)
output:
User_ID Age
0 122 NaN
1 233 24.0
2 366 NaN
data_new =data.drop_duplicates(subset='User_ID', keep='last')
I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):
A
id
value1
1
80
1
78
2
76
2
79
B
id
value2
2
65
2
67
3
74
3
65
# to reproduce the data frames
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
What I'm trying to create:
id
value1
value2
2
76
65
2
79
67
I tried:
data = pd.merge(A, B, on="stay_id")
But the result is:
id
value1
value2
2
76
65
2
76
67
2
79
65
2
79
67
So the first value gets repeated along the axis. I also tried:
complete = A.copy()
complete["B" = B["value2"]
Does this ensure the values being matched for the id?
If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.
Considering there is a third dataframe (called dfC), you can try the code below :
id
value3
2
72
2
83
4
78
4
76
list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]
>>> print(out)
id value
2 2 76
3 2 79
4 2 65
5 2 67
8 2 72
9 2 83
After hours of trying I finally found a way using some of #Lucas M. Uriarte's logic, thanks a lot for that!
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
df1_patients = set(df1.index.values)
df2_patients = set(df2.index.values)
patients = set.intersection(df1, df2)
patients = list(patients)
reduced_df1 = df1.loc[patients]
reduced_df2 = df2.loc[patients]
reduced_df1.sort_index(inplace=True)
reduced_df2.sort_index(inplace=True)
data = reduced_df1.copy()
data["value2"] = reduced_df2["value2"]
This as far as I can see it ensures keeping only the entries that are in both data frames and matches the values row by row in this scenario.
I modify the answer according to the comments exchanged in the question, you need to find a unique identifier to merge from; since you have the number of measurements, this is "n_meassurement" and "stay_id", consider, for example, the following modification of the dataframes:
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65],
"number_measurement":["measurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79],
"number_measurement":["meassurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
output = pd.merge(df1, df2, on=["stay_id", "number_measurement"])
print(output)
Output:
stay_id value_x number_measurement value_y
0 2 65 measurement_1 76
1 2 67 measurement_2 79
now just drop the column number_measurement:
output.drop("number_measurement", axis=1)
stay_id value_x value_y
0 2 65 76
1 2 67 79
I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124
Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:
It shows in my Jupyter Notebook like this
My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.
The desirable output looks like:
I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.
Thanks.
I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:
df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).
You can test it:
print (df.columns[df.columns.duplicated(keep=False)])
Then I suggest join all unnamed levels for prevent it:
df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
Purchase/sell_time Purchase/sell_time_Quantity Purchase/sell_time_Price \
0 2020-04-09 15:22:00 20 43
1 2020-04-09 16:22:00 30 56
Side
0 B
1 S
Your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create an array with your desired columns and to replace the DataFrame column with this, as follows:
# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"): pd.date_range("2020-04-09 15:22:00",
freq="H", periods = 2),
("Purchase/sell_time", "Quantity"): [20,30],
("Purchase/sell_time", "Price"): [43, 56],
("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()
It looks like this:
Purchase/sell_time Side
Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
The column is a multiindex as you can see:
MultiIndex([('Purchase/sell_time', 'Unnamed:'),
('Purchase/sell_time', 'Quantity'),
('Purchase/sell_time', 'Price'),
( 'Side', 'Unnamed:')],
)
# I retrieve the first and second level of the multiindex then create an array conditionally
# on the second level not starting with "Unnamed"
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
first_header, second_header)
df1.columns = merge_header
Here is the result:
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
Hope it helps
For example, The dataframe looks like:
DF=pd.DataFrame([[1,1],[2,120],[3,25],[4,np.NaN],[5,45]],columns=["ID","Age"])
In the Age column, the values below 5 and greater than 100 have to converted to NaN.
Any help is appreciated!
Using where and between
df.Age=df.Age.where(df.Age.between(5,100))
df
ID Age
0 10 NaN
1 20 NaN
2 30 25.0