So I'm very new with pandas excel in python
here's what I'm trying to achieve
data before
what I'm trying to get
I haven't found specific ways to do this with groupby
pls help
You can try groupby then agg
df['c'] = df['c'].astype(str)
out = df.groupby(['a', 'b'])['c'].agg(','.join).reset_index()
print(out)
a b c
0 1 2 3,4
1 1 3 5
2 2 2 3,4
3 2 3 5
4 5 6 7
I have a pandas dataframe where there are multiple integer value columns denoting a count. I want to transform this dataframe such that the value columns are merged into one column but another column is created denoting the column the value was taken from.
Input
a b c
0 2 5 8
1 3 6 9
2 4 7 10
Output
count type
0 2 a
1 3 a
2 4 a
3 5 b
4 6 b
5 7 b
6 8 c
7 9 c
8 10 c
Im sure this is possible by looping over the entries and creating however many rows for each original row but im sure there is a pandas way to achieve this and I would like to know what it is called.
You could do that with the following
pd.melt(df, value_vars=['a','b','c'], value_name='count', var_name='type')
This question already has answers here:
Pandas: Find rows which don't exist in another DataFrame by multiple columns
(2 answers)
Closed 1 year ago.
I have two data frames, df1 and df2. Now, df1 contains 6 records and df2 contains 4 records. I want to get the unmatched records out of it. I tried it but getting an error ValueError: Can only compare identically-labelled DataFrame objects I guess this is due to the length of df as the df1 has 6 and df2 has 4 but how do I compare them both and get the unmatched rows?
code
df1=
a b c
0 1 2 3
1 4 5 6
2 3 5 5
3 5 6 7
4 6 7 8
5 6 6 6
df2 =
a b c
0 3 5 5
1 5 6 7
2 6 7 8
3 6 6 6
index = (df != df2).any(axis=1)
df3 = df.loc[index]
which gives:
ValueError: Can only compare identically-labelled DataFrame objects
Expected output:
a b c
0 1 2 3
1 4 5 6
I know that the error is due to the length but is there any way where we can compare two data frames and get the unmatched records out of it?
Use df.merge with indicator=True and pick all rows except both:
In [173]: df = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"').drop('_merge', 1)
In [174]: df
Out[174]:
a b c
0 1 2 3
1 4 5 6
MultiIndex.from_frame + isin
We can use MultiIndex.from_frame on both df1 and df2 to create the corresponding multiindices, then use isin to test the membership of the index created from df1 in index created from df2 to create a boolean mask which can be then used to filter the non matching rows.
i1 = pd.MultiIndex.from_frame(df1)
i2 = pd.MultiIndex.from_frame(df2)
df1[~i1.isin(i2)]
Result
a b c
0 1 2 3
1 4 5 6
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6