Make list after groupby in pandas using apply() function - python

I have this dataframe:
c1 c2
0 B 1
1 A 2
2 B 5
3 A 3
4 A 7
My goal is to keep tracking the values in column2, based on the letters of column1 separated by(:), the output should look like this:
c1 list
0 A 2:3:7
1 B 1:5
What's the most pythonic way to do this:
At the moment I'm able to group by the column 1 and I'm trying to use the apply() function, but I do not know how to map and make this list in the new column.

Try this:
df = df.groupby("c1")["c2"].apply(lambda x: ":".join([str(i) for i in x])).reset_index()

You can use groupby
>>> import pandas as pd
>>> df = pd.DataFrame({'c1': ['B', 'A', 'B', 'A', 'A'], 'c2': [1, 2, 5, 3, 7]})
>>>
>>> df.c2 = df.c2.astype(str)
>>> new_df = df.groupby("c1")['c2'].apply(":".join).reset_index()
>>> new_df
c1 c2
0 A 2:3:7
1 B 1:5

i think you can just do a string join
df = pandas.DataFrame({"c1":list("BABAA"),"c2":[1,2,5,3,7]})
df['c2'] = df['c2'].astype(str)
df.groupby('c1').agg({'c2':':'.join})
you might get more mileage from
df.groupby('c1').agg({'c2':list})

Related

How to drop rows based on column value if column is not set as index in pandas?

I have a list and a dataframe which look like this:
list = ['a', 'b']
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]})
I would like to do something like this:
df1 = df.drop([x for x in list])
I am getting the following error message:
"KeyError: "['a' 'b'] not found in axis""
I know I can do the following:
df = pd.DataFrame({'A':['a', 'b', 'c', 'd'], 'B':[9, 9, 8, 4]}).set_index('A')
df1 = df.drop([x for x in list])
How can I drop the list values without having to set column 'A' as index? My dataframe has multiple columns.
Input:
A B
0 a 9
1 b 9
2 c 8
3 d 4
Code:
for i in list:
ind = df[df['A']==i].index.tolist()
df=df.drop(ind)
df
Output:
A B
2 c 8
3 d 4
You need to specify the correct axis, namely axis=1.
According to the docs:
axis{0 or ‘index’, 1 or ‘columns’}, default 0. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
You also need to make sure you are using the correct column names, (ie. uppercase rather than lower case. And you should avoid using python reserved names, so don't use list.
This should work:
myList = ['A', 'B']
df1 = df.drop([x for x in myList], axis=1)

Adding a DataFrame to a level in Pandas

I have a MultiIndex DataFrame and would like to take a level and put a new DataFrame in its place. So if I had a DataFrame with levels like this:
a 1
2
b 3
4
Would I be able to swap out ['b', 3] with a DataFrame like this
10
11
Resulting in this:
a 1
2
b 10
11
4
I took a crack at this, and couldn't quite get what you wanted. Instead of 'inserting' a df, you can delete the row(s) in question, then concat/join/merge as needed. However, this does not sort the data in the method you requested.
import pandas as pd
arrays = [['a', 'a', 'b', 'b'], [1, 2, 3, 4], ['str1', 'str2', 'str3', 'str4']]
df1 = pd.DataFrame(list(zip(*arrays)),
columns = ['let', 'num', 'string']).set_index(['let', 'num'])
df2 = pd.DataFrame(list(zip(*[['b', 'b'], [10, 11], ['str5', 'str6']])),
columns = ['let', 'num', 'string']).set_index(['let', 'num'])
df3 = pd.concat([df1.drop(3, level = 'num'), df2])
Output:
string
let num
a 1 str1
2 str2
b 4 str4
10 str5
11 str6
I tried a number of other methods to insert data into the middle of the df, and was met with error after error. I'll keep messing with it; it's a good problem-oriented multiindex tutorial.

pandas: Extracting the index of the maximum value in an expanding window

In a pandas DataFrame, I can create a Series B with the maximum value of another Series A, from the first row to the current one, by using an expanding window:
df['B'] = df['A'].expanding().max()
I can also extract the value of the index of the maximum overall value of Series A:
idx_max_A = df['A'].idxmax().value
What I want is an efficient way to combine both; that is, to create a Series B that holds the value of the index of the maximum value of Series A from the first row up to the current one. Ideally, something like this...
df['B'] = df['A'].expanding().idxmax().value
...but, of course, the above fails because the Expanding object does not have idxmax. Is there a straightforward way to do this?
EDIT: For illustration purposes, for the following DataFrame...
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
...I'd like to create an additional column B so that the DataFrame contains the following:
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
I believe you can use expanding + max + groupby:
v = df.expanding().max().A
df['B'] = v.groupby(v).transform('idxmax')
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
It seems idmax is a function in the latest version of pandas which I don't have yet. Here's a solution not involving groupby or idmax
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
temp = df.A.expanding().max()
df['B'] = temp.apply(lambda x: temp[temp == x].index[0])
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

How to assign a name to the size() column?

I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work

Categories