If I have a DataFrame like this:
type value group
a 10 one
b 45 one
a 224 two
b 119 two
a 33 three
b 44 three
how do I make it into this:
type one two three
a 10 224 33
b 45 119 44
I thought it'd be pivot_table, but that just gives me a re-grouped list.
I think you need pivot with rename_axis (new in pandas 0.18.0) and reset_index:
print df.pivot(index='type', columns='group', values='value')
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 10 33 224
1 b 45 44 119
If ordering of columns is important:
df = df.pivot(index='type', columns='group', values='value').rename_axis(None, axis=1)
print df[['one','two','three']].reset_index()
type one two three
0 a 10 224 33
1 b 45 119 44
EDIT:
In your real data you can get error:
print df.pivot(index='type', columns='group', values='value')
.rename_axis(None, axis=1)
.reset_index()
ValueError: Index contains duplicate entries, cannot reshape
print df
type value group
0 a 10 one
1 a 20 one
2 b 45 one
3 a 224 two
4 b 119 two
5 a 33 three
6 b 44 three
Problem is in second row - you get for index value a and column one two values - 10 and 20. Function pivot_table aggregate data in this case. Dafault aggregating function is np.mean, but you can change it by parameter aggfunc:
print df.pivot_table(index='type', columns='group', values='value', aggfunc=np.mean)
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 15 33 224
1 b 45 44 119
print df.pivot_table(index='type', columns='group', values='value', aggfunc='first')
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 10 33 224
1 b 45 44 119
print df.pivot_table(index='type', columns='group', values='value', aggfunc=sum)
.rename_axis(None, axis=1)
.reset_index()
type one three two
0 a 30 33 224
1 b 45 44 119
Related
I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):
A
id
value1
1
80
1
78
2
76
2
79
B
id
value2
2
65
2
67
3
74
3
65
# to reproduce the data frames
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
What I'm trying to create:
id
value1
value2
2
76
65
2
79
67
I tried:
data = pd.merge(A, B, on="stay_id")
But the result is:
id
value1
value2
2
76
65
2
76
67
2
79
65
2
79
67
So the first value gets repeated along the axis. I also tried:
complete = A.copy()
complete["B" = B["value2"]
Does this ensure the values being matched for the id?
If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.
Considering there is a third dataframe (called dfC), you can try the code below :
id
value3
2
72
2
83
4
78
4
76
list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]
>>> print(out)
id value
2 2 76
3 2 79
4 2 65
5 2 67
8 2 72
9 2 83
After hours of trying I finally found a way using some of #Lucas M. Uriarte's logic, thanks a lot for that!
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
df1_patients = set(df1.index.values)
df2_patients = set(df2.index.values)
patients = set.intersection(df1, df2)
patients = list(patients)
reduced_df1 = df1.loc[patients]
reduced_df2 = df2.loc[patients]
reduced_df1.sort_index(inplace=True)
reduced_df2.sort_index(inplace=True)
data = reduced_df1.copy()
data["value2"] = reduced_df2["value2"]
This as far as I can see it ensures keeping only the entries that are in both data frames and matches the values row by row in this scenario.
I modify the answer according to the comments exchanged in the question, you need to find a unique identifier to merge from; since you have the number of measurements, this is "n_meassurement" and "stay_id", consider, for example, the following modification of the dataframes:
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65],
"number_measurement":["measurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79],
"number_measurement":["meassurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
output = pd.merge(df1, df2, on=["stay_id", "number_measurement"])
print(output)
Output:
stay_id value_x number_measurement value_y
0 2 65 measurement_1 76
1 2 67 measurement_2 79
now just drop the column number_measurement:
output.drop("number_measurement", axis=1)
stay_id value_x value_y
0 2 65 76
1 2 67 79
I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124
I've a pandas dataframe named 'trdf' with the shape [1 row X 420 columns].
0 1 2 \
0 B0742F7GT8 Stone & Beam Modern Tripod Floor Lamp, 61"H, W... 2018-04-22
3 4 5 6 7 8 9 ... \
0 24-Apr-2018 100.00% 17.06% 0.00% 5 66.67% 8 ...
410 411 412 413 414 415 416 417 418 419
0 56 161 -8 -166.67% 0 1 0.00% 100.00% 8 Planned Replenishment
I want to slice every 20 columns from last and add the column values as new row values. here is my code :
for i in range(420,20,-20):
trdf.append(trdf.loc[:,i:i-20])
print(trdf)
However, the dataframe is still same with respect to shape and values. Where's the error ?
I believe first create MultiIndex in columns and then unstack:
df.columns = [df.columns % 20, df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
Or use numpy solution with reshape, but finally all data are strings:
df = pd.DataFrame(df.values.reshape(20, 21))
If want use your solution create list of one row DataFrames and concat together:
L = []
for i in range(420,20,-20):
#change order for selecting
df2 = df.loc[:,i-20:i]
#for same columns
df2.columns = range(20)
L.append(df2)
df1 = pd.concat(L)
Also if need expected output join from last columns to first:
df.columns = [df.columns % 20, 20-df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
And:
df1 = pd.DataFrame(df.values.reshape(20, 21)[::-1])
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()