How to combine values in a column by group?

How to combine values in a column by group? - python

I have this dataframe as input read from excel:
Name Data
0 A green
1 A blue
2 A yellow
3 A green
4 B green
5 B green
6 B red
7 C violet
8 C red
9 C white
Desired output:
Name Data
0 A blue;green;yellow
1 B green;red
2 C red;violet;white
I tried the following, both gave errors
pivot_df = df.pivot(index = df.columns[0], columns = df.columns[1]) ## Gives --> ValueError: Index contains duplicate entries, cannot reshape
pivot_table_df = df.pivot_table(index = df.columns[0], columns = df.columns[1]) ## gives --> pandas.core.base.DataError: No numeric types to aggregate

A simple way to do it is -
df.groupby(['Name'])['Data'].apply(set).apply(';'.join).reset_index()
Name Data
0 A yellow;green;blue
1 B red;green
2 C red;violet;white

# convert type to string you can also use .astype(str) which is vectorized
df["Data"] = df["Data"].map(str)
# group data by name, set index parameters false otherwise you will have "Name" as an index. Theoretically you could simply do ["Data"].apply(list) but aggregate is more scalable in case other columns will be added later on.
df = df.groupby(["Name"], as_index=False).aggregate({"Data": set})
# df["Data"] now contains a set, we want to get a ordered, concatenated string with the delimiter ";" out of it, therefore we use ";".join() to join a list to a string. I use .map which is not vectorized, and this part is therefore probably up for improvement.
df["Data"] = df["Data"].map(lambda x: ";".join(sorted(x)))

Since the 'Data' column can contain numbers (as stated in comments), it's better to set the dtype as str, because .join is a str method, and it's more efficient then using map inside the lambda function (e.g. map(str, set(x))).
Use .groupby on 'Data' and .apply a function
lambda x: ';'.join(sorted(set(x))))
Each group contains non-unique values, so use set, because a set may only contain unique values.
Use sorted if you want the result in order, otherwise replace sorted(set(x)) with set(x).
import pandas as pd
# test data
data = {'Name': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'B', 'C'],
'Data': ['green', 'blue', 'yellow', 'green', 'green', 'green', 'red', 'violet', 'red', 'white', 3, 3, 3]}
# create dataframe
df = pd.DataFrame(data)
# convert the Data column to str type
df.Data = df.Data.astype('str')
# groupby name and apply the function
dfg = df.groupby('Name', as_index=False)['Data'].apply(lambda x: ';'.join(sorted(set(x))))
# display(dfg)
Name Data
0 A 3;blue;green;yellow
1 B 3;green;red
2 C 3;red;violet;white

Related

How to get the frequency of column depending on certain values of another column [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)

The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.

You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2

The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)

You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2

lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Calculating how many values are in a column per each index [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)

The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.

You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2

The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)

You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2

lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

How to rename columns by position?

from pandas import DataFrame
df = DataFrame({
'A' : [1, 2, 3]
, 'B' : [3, 2, 1]
})
print(df.rename(columns={'A': 'a', 'B': 'b'}))
I know that I can rename columns like the above.
But sometimes, I just what to rename columns by position so that I don't have to specify the old names.
df.rename(columns=['a', 'b']))
df.rename(columns={0: 'a', 1: 'b'})
I tried the above. But none of them work. Could anybody show me a concise way to rename without specifying original names?
I looking for a way with minimal code. Ideal, this should and could have been supported by the current function interface of rename(). Using a for-loop or something to create a dictionary with the old column names could be a solution but not preferred.

Here you go:
df.rename(columns={ df.columns[0]: 'a', df.columns[1]: 'b',}, inplace = True)
df
Prints:
a b
0 1 3
1 2 2
2 3 1

You might assign directly to pandas.DataFrame.columns i.e.
import pandas as pd
df = pd.DataFrame({'A':[1, 2, 3],'B':[3, 2, 1]})
df.columns = ["X","Y"]
print(df)
output
X Y
0 1 3
1 2 2
2 3 1

You can create a function to rename them:
names = iter(['a', 'b'])
def renamer(col):
return next(names)
df.rename(renamer, axis='columns', inplace=True)
The advantage of this approach is that it is enough that you know the order of the columns to rename and renamer function does not even have to use its parameter.

With this method, you can rename whatever columns by position. It converts the index position (0, 1) into the column name and create the original mapping ({'A': 'a', 'B': 'b'})
c = {0: 'a', 1: 'b'}
m = dict(zip(df.columns[list(c.keys())], c.values()))
>>> m
{'A': 'a', 'B': 'b'}
>>> df.rename(columns=m)
a b
0 1 3
1 2 2
2 3 1

Unclear if the other solutions work in the presence of two or more columns with the same name. Here's a function which does:
# rename df columns by position (as opposed to index)
# mapper is a dict where keys = ordinal column position and vals = new titles
# unclear that using the native df rename() function produces the correct results when renaming by position
def rename_df_cols_by_position(df, mapper):
new_cols = [df.columns[i] if i not in mapper.keys() else mapper[i] for i in range(0, len(df.columns))]
df.columns = new_cols
return

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.

Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4

You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Add row counts to entire dataframe [duplicate]

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)

The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.

You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2

The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)

You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2

lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to combine values in a column by group? - python

A simple way to do it is - df.groupby(['Name'])['Data'].apply(set).apply(';'.join).reset_index() Name Data 0 A yellow;green;blue 1 B red;green 2 C red;violet;white

Related

How to get the frequency of column depending on certain values of another column [duplicate]

Calculating how many values are in a column per each index [duplicate]

How to rename columns by position?

How to drop rows based on column value if column is not set as index in pandas?

Add row counts to entire dataframe [duplicate]

Categories

Resources