Reshape dataframe from long to wide - python

my df:
d = {'project_id': [19,20,19,20,19,20],
'task_id': [11,22,11,22,11,22],
"task": ["task_1","task_1","task_1","task_1","task_1","task_1"],
"username": ["tom","jery","tom","jery","tom","jery"],
"image_id":[101,202,303,404,505,606],
"frame":[0,0,9,8,11,11],
"label":['foo','foo','bar','xyz','bar','bar']}
df = pd.DataFrame(data=d)
So my df, is long format, in some duplicate and only image_id is unique.
I trying pivot my df, with pd.pivot and pd.merge reshape to wide format by username.
My code:
pd.pivot(df, index=['task','frame','image_id'], columns = 'username', values='label')
My output:
I expected(or want to reach):
So, as you see, I don't really need image_id in my output. Just summary, which user use tag per frame.

You can add a groupby.first after the pivot:
(pd.pivot(df, index=['task','frame','image_id'],
columns='username', values='label')
.groupby(level=['task','frame']).first()
)
Or use pivot_table with aggfunc='first':
pd.pivot_table(df, index=['task','frame'],
columns='username', values='label',
aggfunc='first')
Output:
username jery tom
task frame
task_1 0 foo foo
8 xyz None
9 None bar
11 bar bar

Related

Copy matching value from one df to another given multiple conditions

I have two dataframes. The first, df1, has a non-unique ID and a timestamp value in ms. The other, df2, has the non-unique ID, a separate unique ID, a start time and an end time (both in ms).
I need to get the correct unique ID for each row in df1 from df2. I would do this by...
match each non-unique ID in df1 to the relevant series of rows in df2
of those rows, find the one with the start and end range that contains the timestamp in df1
get the unique ID from the resulting row and copy it to a new column in df1
I don't think I can use pd.merge since I need to compare the df1 timestamp to two different columns in df2. I would think df.apply is my answer, but I can't figure it out.
Here is some dummy code:
df1_dict = {
'nonunique_id': ['abc','def','ghi','jkl'],
'timestamp': [164.3,2071.2,1001.7,846.4]
}
df2_dict = {
'nonunique_id': ['abc','abc','def','def','ghi','ghi','jkl','jkl'],
'unique_id': ['a162c1','md85k','dk102','l394j','dj4n5','s092k','dh567','57ghed0'],
'time_start': [160,167,2065,2089,1000,1010,840,876],
'time_end': [166,170,2088,3000,1009,1023,875,880]
}
df1 = pd.DataFrame(data=df1_dict)
df2 = pd.DataFrame(data=df2_dict)
And here is a manual test...
df2['unique_id'][(df2['nonunique_id'].eq('abc')) & (df2['time_start']<=164.3) & (df2['time_end']>=164.3)]
...which returns the expected output (the relevant unique ID from df2):
0 a162c1
Name: unique_id, dtype: object
I'd like a function that can apply the above manual test automatically, and copy the results to a new column in df1.
I tried this...
def unique_id_fetcher(nonunique_id,timestamp):
cond_1 = df2['nonunique_id'].eq(nonunique_id)
cond_2 = df2['time_start']<=timestamp
cond_3 = df2['time_end']>=timestamp
unique_id = df2['unique_id'][(cond_1) & (cond_2) & (cond_3)]
return unique_id
df1['unique_id'] = df1.apply(unique_id_fetcher(df1['nonunique_id'],df1['timestamp']))
...but that results in:
ValueError: Can only compare identically-labeled Series objects
(Edited for clarity)
IIUC,
you can do a caretsian product of both dataframes and do a merge, then apply your logic
you create a dict and map the values back onto your df1 using non_unique_id as the key.
df1['key'] = 'var'
df2['key'] = 'var'
df3 = pd.merge(df1,df2,on=['key','nonunique_id'],how='outer')
df4 = df3.loc[
(df3["timestamp"] >= df3["time_start"]) & (df3["timestamp"] <= df3["time_end"])
]
d = dict(zip(df4['nonunique_id'],df4['unique_id']))
df1['unique_id'] = df1['nonunique_id'].map(d)
print(df1.drop('key',axis=1))
nonunique_id timestamp unique_id
0 abc 164.3 a162c1
1 def 2071.2 dk102
2 ghi 1001.7 dj4n5
3 jkl 846.4 dh567

add two pandas dataframe columns which differs by only suffix parameter for e.g., "A_x", "A_y" and rename these two columns addition with "A"

How to add two pandas dataframe columns which differs by only suffix parameter for e.g., "A_x", "A_y" and rename these two columns addition with "A".
For e.g., I have a data like this
enter image description here
The columns must be renamed without any of the suffix ie., to CT_1 or CT_2 etc....
Use:
df = pd.DataFrame([np.arange(6)], columns=['a','s','CT_1_x','CT_1_y','CT_2_x','CT_2_y'])
print (df)
a s CT_1_x CT_1_y CT_2_x CT_2_y
0 0 1 2 3 4 5
df = df.set_index(['a','s']).groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum().reset_index()
print (df)
a s CT_1 CT_2
0 0 1 5 9
To add the two columns
df['A'] = df['A_x'] + df['A_y']
and if you want to remove the original columns
df.drop(columns = ['A_x','A_y'])
If you have too many such columns col2sum = ['A_1', 'A_2', ...] to type by hand, the best way would be to melt the df into a long form.
dfm = melt(df, id_vars = ???, value_vars = col2sum)
and go from there (eg groupby) .

Retrieve multiple lookup values in large dataset?

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

How can I convert a Group By series to a Dataframe?

I have this DataFrame:
import pandas as pd
df = pd.DataFrame( {
"Name" : ["Bob", "Bryan", "Bob", "Bryan" , "Bryan"] ,
"Value" : [10,20,15,50,45] } )
Then I got the minimum value per person:
df1 = df.groupby(["Name"])["Value"].min()
This is quite simple. However, I want to keep working with dataframes, but df1 is a serie:
type(df1)
How can I convert it to a dataframe again?
Use parameter as_index=False in DataFrame.groupby:
df1 = df.groupby(["Name"],as_index=False)["Value"].min()
Or add Series.reset_index:
df1 = df.groupby(["Name"])["Value"].min().reset_index()
print (df1)
Name Value
0 Bob 10
1 Bryan 20
Also you may use agg() which return a dataframe.
df1 = df.groupby("Name").agg({'Value': 'min'}).reset_index()

I want to extract QSTS_ID column and delimit by full stop and append it to the exisisting list as a seperate column

enter image description hereWhen applying the below code , i am getting NAN values in the entire column of QSTS_ID
df['QSTS_ID'] = df['QSTS_ID'].str.split('.',expand=True)
df
I want to copy the entire QSTS_ID column and append it at the end. I also have to delimit it by fullstop and apply new headers
Problem is if add parameter expand=True it return DataFrame with one or more columns, so assign return NaNs.
Solution is add new columns with join or concat to original DataFrame, also add_prefix is for change new columns names:
df = df.join(df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df, df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
If want also remove original column:
df = df.join(df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df,
df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
Sample:
df = pd.DataFrame({
'QSTS_ID':['val_k.lo','val2.s','val3.t'],
'F':list('abc')
})
df1 = df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')
df = df.join(df1)
print (df)
QSTS_ID F QSTS_ID_0 QSTS_ID_1
0 val_k.lo a val_k lo
1 val2.s b val2 s
2 val3.t c val3 t
#check columns names of new columns
print (df1.columns)

Categories