Python: Transform or Melt and Groupby a Dataframe - python

I have something like this:
df =
col1 col2 col3
0 B C A
1 E D G
2 NaN F B
EDIT : I need to convert it into this:
result =
Name location
0 B col1,col2
1 C col1
2 A col1
3 E col2
4 D col2
5 G col2
6 F col3
Essentially getting a "location" telling me which column an "Name" is in. Thank you in advance.

Try melt and dropna:
>>> df.melt(var_name='location').dropna().groupby('value', sort=False, as_index=False).agg(', '.join)
value location
0 B col1, col3
1 E col1
2 C col2
3 D col2
4 F col2
5 A col3
6 G col3
>>>
Also groupby and agg.

Or an alternative with stack():
new = df.stack().reset_index().drop('level_0',axis=1).dropna()
new.columns = ['name','location']
prints:
name location
0 col1 B
1 col2 C
2 col3 A
3 col1 E
4 col2 D
5 col3 G
6 col2 F
EDIT:
To get your updated output you could use a groupby along with join():
new.groupby('location').agg({'name':lambda x: ', '.join(list(x))}).reset_index()
Which gives you:
location name
0 A col3
1 B col1, col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3

Try using melt to convert columns to rows. And give the rows a column name.
Then dropna to remove the NaN values in rows.
df = df.melt(var_name="location", value_name="Name").dropna()

You can use pandas.melt and pandas.groupby.agg:
df = df.melt(var_name="location", value_name="Name").dropna()
new_df = df.groupby("Name", as_index=False).agg(",".join)
print(new_df)
Output:
Name location
0 A col3
1 B col1,col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3

Related

Pandas get column value based on row value [duplicate]

This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 1 year ago.
I have the following dataframe:
df = pd.DataFrame(data={'flag': ['col3', 'col2', 'col2'],
'col1': [1, 3, 2],
'col2': [5, 2, 4],
'col3': [6, 3, 6],
'col4': [0, 4, 4]},
index=pd.Series(['A', 'B', 'C'], name='index'))
index
flag
col1
col2
col3
col4
A
col3
1
5
6
0
B
col2
3
2
3
4
C
col2
2
4
6
4
For each row, I want to get the value when column name is equal to the flag.
index
flag
col1
col2
col3
col4
col_val
A
col3
1
5
6
0
6
B
col2
3
2
3
4
2
C
col2
2
4
6
4
4
– Index A has a flag of col3. So col_val should be 6 because df['col3'] for that row is 6.
– Index B has a flag of col2. So col_val should be 2 because df['col2'] for that row is 2.
– Index C has a flag of col2. So col_val should be 4 because df['col2'] for that row is 3.
Per this page:
idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4
The docs has an example that you can adapt; the solution is below is just another option.
What it does is flip the dataframe into a MultiIndex dataframe, select the relevant columns and trim it to non nulls::
cols = [(ent, ent) for ent in df.flag.unique()]
(df.assign(col_val = df.pivot(index = None, columns = 'flag')
.loc(axis = 1)[cols].sum(1)
)
flag col1 col2 col3 col4 col_val
index
A col3 1 5 6 0 6.0
B col2 3 2 3 4 2.0
C col2 2 4 6 4 4.0
try this:
cond = ([df.columns.values[1:]] * df.shape[0]) == df.flag.values.reshape(-1,1)
df1 = df.set_index('flag', append=True)
df1.join(df1.where(cond).ffill(axis=1).col4.rename('res')).reset_index('flag')

Selecting rows with column value in other column list in Python

I have a pandas data frame with the following format:
col1 col2 ... col4
A 2 [2-3-4]
B 3 [2-6]
A 3 [2-3-4]
C 2 [2-3-4]
D 2 [2-3-4]
I would like to select only the rows where the value in col2 is in the list of col4.
I tried to use:
df[(df["col2"].isin(df["col4"].str.split("-"))]
but I get an empty data frame...
I would use a list comprehension here for this usecase:
df[[str(a) in b for a,b in zip(df['col2'],df['col4'])]]
col1 col2 col4
0 A 2 [2-3-4]
2 A 3 [2-3-4]
3 C 2 [2-3-4]
4 D 2 [2-3-4]
Or using regex search which will not match 2 with 22 #thanks #Nk03
import re
df[[bool(re.search(fr'\b{a}\b',b)) for a,b in zip(df['col2'],df['col4'])]]
Code
df['col4'] = df.col4.astype(str).str.replace('-',',')
df['col2'] = df.col2.astype(str)
df= df[df.apply(lambda x: x.col2 in x.col4, axis=1)]
Output
col1 col2 col4
0 A 2 [2,3,4]
2 A 3 [2,3,4]
3 C 2 [2,3,4]
4 D 2 [2,3,4]
You can try this :
import ast
df.col4 = df.col4.str.replace('-',',').apply(ast.literal_eval)
new_df = df[df.apply(lambda x: x['col2'] in x['col4'], axis =1)]

Pandas multiple dataframes into one

I'm looping through list with multiple dicionaries and want them to be appended into single data frame.
#getting values of specific key from AWS' boto3 response
events_list = response_event.get('Events')
for e in events_list:
df = pd.DataFrame.from_dict(e)
print(df)
Current and expected result below:
col1 col2
0 1 3
col1 col2
0 2 4
col1 col2
0 3 5
col1 col2
0 1 3
1 2 4
2 3 5
Try with concat
out = pd.concat(pd.DataFrame.from_dict(e) for e in events_list)

Fetch row data from known index in pandas

df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1

How to do group by and take count of unique and count of some value as aggregate on same column in python pandas?

My question is related to my previous Question but it's different. So I am asking the new question.
In above question see the answer of #jezrael.
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
Now here I want to take count for the specific value of col4 . Say I also want to take count of col4 == 3 in the same query.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'}) ... + count(col4=='3')
How to do this in same above query I have tried bellow but not getting solution.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique','col4':'x: lambda x[x == 7].count()'})
Do some preprocessing by including the col4==3 as a column ahead of time. Then use aggregate
df.assign(result_col=df.col4.eq(3).astype(int)).groupby(
['col1', 'col2']
).agg(dict(col3='size', col4='nunique', result_col='sum'))
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
old answers
g = df.groupby(['col1', 'col2'])
g.agg({'col3':'size','col4': 'nunique'}).assign(
result_col=g.col4.apply(lambda x: x.eq(3).sum()))
col3 col4 result_col
col1 col2
1 4 2 1 2
6 1 1 0
slightly rearranged
g = df.groupby(['col1', 'col2'])
final_df = g.agg({'col3':'size','col4': 'nunique'})
final_df.insert(1, 'result_col', g.col4.apply(lambda x: x.eq(3).sum()))
final_df
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
I think you need aggregate with list of function in dict for column col4.
If need count 3 values the simpliest is sum True values in x == 3:
df1 = df.groupby(['col1','col2'])
.agg({'col3':'size','col4': ['nunique', lambda x: (x == 3).sum()]})
df1 = df1.rename(columns={'<lambda>':'count_3'})
df1.columns = ['{}_{}'.format(x[0], x[1]) for x in df1.columns]
print (df1)
col4_nunique col4_count_3 col3_size
col1 col2
1 4 1 2 2
6 1 0 1

Categories