I have something like this:
df =
col1 col2 col3
0 B C A
1 E D G
2 NaN F B
EDIT : I need to convert it into this:
result =
Name location
0 B col1,col2
1 C col1
2 A col1
3 E col2
4 D col2
5 G col2
6 F col3
Essentially getting a "location" telling me which column an "Name" is in. Thank you in advance.
Try melt and dropna:
>>> df.melt(var_name='location').dropna().groupby('value', sort=False, as_index=False).agg(', '.join)
value location
0 B col1, col3
1 E col1
2 C col2
3 D col2
4 F col2
5 A col3
6 G col3
>>>
Also groupby and agg.
Or an alternative with stack():
new = df.stack().reset_index().drop('level_0',axis=1).dropna()
new.columns = ['name','location']
prints:
name location
0 col1 B
1 col2 C
2 col3 A
3 col1 E
4 col2 D
5 col3 G
6 col2 F
EDIT:
To get your updated output you could use a groupby along with join():
new.groupby('location').agg({'name':lambda x: ', '.join(list(x))}).reset_index()
Which gives you:
location name
0 A col3
1 B col1, col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3
Try using melt to convert columns to rows. And give the rows a column name.
Then dropna to remove the NaN values in rows.
df = df.melt(var_name="location", value_name="Name").dropna()
You can use pandas.melt and pandas.groupby.agg:
df = df.melt(var_name="location", value_name="Name").dropna()
new_df = df.groupby("Name", as_index=False).agg(",".join)
print(new_df)
Output:
Name location
0 A col3
1 B col1,col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3
Related
This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 1 year ago.
I have the following dataframe:
df = pd.DataFrame(data={'flag': ['col3', 'col2', 'col2'],
'col1': [1, 3, 2],
'col2': [5, 2, 4],
'col3': [6, 3, 6],
'col4': [0, 4, 4]},
index=pd.Series(['A', 'B', 'C'], name='index'))
index
flag
col1
col2
col3
col4
A
col3
1
5
6
0
B
col2
3
2
3
4
C
col2
2
4
6
4
For each row, I want to get the value when column name is equal to the flag.
index
flag
col1
col2
col3
col4
col_val
A
col3
1
5
6
0
6
B
col2
3
2
3
4
2
C
col2
2
4
6
4
4
– Index A has a flag of col3. So col_val should be 6 because df['col3'] for that row is 6.
– Index B has a flag of col2. So col_val should be 2 because df['col2'] for that row is 2.
– Index C has a flag of col2. So col_val should be 4 because df['col2'] for that row is 3.
Per this page:
idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4
The docs has an example that you can adapt; the solution is below is just another option.
What it does is flip the dataframe into a MultiIndex dataframe, select the relevant columns and trim it to non nulls::
cols = [(ent, ent) for ent in df.flag.unique()]
(df.assign(col_val = df.pivot(index = None, columns = 'flag')
.loc(axis = 1)[cols].sum(1)
)
flag col1 col2 col3 col4 col_val
index
A col3 1 5 6 0 6.0
B col2 3 2 3 4 2.0
C col2 2 4 6 4 4.0
try this:
cond = ([df.columns.values[1:]] * df.shape[0]) == df.flag.values.reshape(-1,1)
df1 = df.set_index('flag', append=True)
df1.join(df1.where(cond).ffill(axis=1).col4.rename('res')).reset_index('flag')
I have a pandas data frame with the following format:
col1 col2 ... col4
A 2 [2-3-4]
B 3 [2-6]
A 3 [2-3-4]
C 2 [2-3-4]
D 2 [2-3-4]
I would like to select only the rows where the value in col2 is in the list of col4.
I tried to use:
df[(df["col2"].isin(df["col4"].str.split("-"))]
but I get an empty data frame...
I would use a list comprehension here for this usecase:
df[[str(a) in b for a,b in zip(df['col2'],df['col4'])]]
col1 col2 col4
0 A 2 [2-3-4]
2 A 3 [2-3-4]
3 C 2 [2-3-4]
4 D 2 [2-3-4]
Or using regex search which will not match 2 with 22 #thanks #Nk03
import re
df[[bool(re.search(fr'\b{a}\b',b)) for a,b in zip(df['col2'],df['col4'])]]
Code
df['col4'] = df.col4.astype(str).str.replace('-',',')
df['col2'] = df.col2.astype(str)
df= df[df.apply(lambda x: x.col2 in x.col4, axis=1)]
Output
col1 col2 col4
0 A 2 [2,3,4]
2 A 3 [2,3,4]
3 C 2 [2,3,4]
4 D 2 [2,3,4]
You can try this :
import ast
df.col4 = df.col4.str.replace('-',',').apply(ast.literal_eval)
new_df = df[df.apply(lambda x: x['col2'] in x['col4'], axis =1)]
I'm looping through list with multiple dicionaries and want them to be appended into single data frame.
#getting values of specific key from AWS' boto3 response
events_list = response_event.get('Events')
for e in events_list:
df = pd.DataFrame.from_dict(e)
print(df)
Current and expected result below:
col1 col2
0 1 3
col1 col2
0 2 4
col1 col2
0 3 5
col1 col2
0 1 3
1 2 4
2 3 5
Try with concat
out = pd.concat(pd.DataFrame.from_dict(e) for e in events_list)
df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
My question is related to my previous Question but it's different. So I am asking the new question.
In above question see the answer of #jezrael.
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
Now here I want to take count for the specific value of col4 . Say I also want to take count of col4 == 3 in the same query.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'}) ... + count(col4=='3')
How to do this in same above query I have tried bellow but not getting solution.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique','col4':'x: lambda x[x == 7].count()'})
Do some preprocessing by including the col4==3 as a column ahead of time. Then use aggregate
df.assign(result_col=df.col4.eq(3).astype(int)).groupby(
['col1', 'col2']
).agg(dict(col3='size', col4='nunique', result_col='sum'))
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
old answers
g = df.groupby(['col1', 'col2'])
g.agg({'col3':'size','col4': 'nunique'}).assign(
result_col=g.col4.apply(lambda x: x.eq(3).sum()))
col3 col4 result_col
col1 col2
1 4 2 1 2
6 1 1 0
slightly rearranged
g = df.groupby(['col1', 'col2'])
final_df = g.agg({'col3':'size','col4': 'nunique'})
final_df.insert(1, 'result_col', g.col4.apply(lambda x: x.eq(3).sum()))
final_df
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
I think you need aggregate with list of function in dict for column col4.
If need count 3 values the simpliest is sum True values in x == 3:
df1 = df.groupby(['col1','col2'])
.agg({'col3':'size','col4': ['nunique', lambda x: (x == 3).sum()]})
df1 = df1.rename(columns={'<lambda>':'count_3'})
df1.columns = ['{}_{}'.format(x[0], x[1]) for x in df1.columns]
print (df1)
col4_nunique col4_count_3 col3_size
col1 col2
1 4 1 2 2
6 1 0 1