I have a df ,you can have it by copy and run the following code:
import pandas as pd
from io import StringIO
df = """
b_id duration1 duration2 user
366 NaN 38 days 22:05:06.807430 Test
367 0 days 00:00:05.285239 NaN Test
368 NaN NaN Test
371 NaN NaN Test
378 NaN 451 days 14:59:28.830482 Test
384 28 days 21:05:16.141263 0 days 00:00:44.999706 Test
466 NaN 38 days 22:05:06.807430 Tom
467 0 days 00:00:05.285239 NaN Tom
468 NaN NaN Tom
471 NaN NaN Tom
478 NaN 451 days 14:59:28.830482 Tom
484 28 days 21:05:16.141263 0 days 00:00:44.999706 Tom
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
df
My question is ,how can I get the mean value of each duration of each user ?
The output should something like this(the mean value is a fake one for sample ,not the exactly mean value):
mean_duration1 mean_duration2 user
8 days 22:05:06.807430 3 days 22:05:06.807430 Test
2 days 00:00:05.285239 4 days 22:05:06.807430 Tom
You can use:
out = (df
.set_index('user')
.filter(like='duration')
.apply(pd.to_timedelta)
.groupby(level=0).mean()
.reset_index()
)
Output:
user duration1 duration2
0 Test 14 days 10:32:40.713251 163 days 12:21:46.879206
1 Tom 14 days 10:32:40.713251 163 days 12:21:46.879206
Related
I would really appreciate assistance in developing Python/ Pandas code to left-join 2 separate CSV files. I'm very new to Python so unclear how to begin. I have used Excel plugins to achieve same but it would normally hang or take hours to complete due to the huge amount of records being processed.
Scenario: joining on the first column.
CSV1
AN DNM OCD TRI
1 343 5656 90
2 344 5657 91
3 345 5658 92
4 346 5659 93
CSV2
AN2 STATE PLAN
4 3 19
3 2 35
7 3 19
8 3 19
Result inclusive of a match status if possible:
AN DNM OCD TRI STATE PLAN Join Status
1 343 5656 90 No_match
2 344 5657 91 No_match
3 345 5658 92 2 35 Match
4 346 5659 93 3 19 Match
All help appreciated.
You can use .merge with indicator= parameter:
out = df1.merge(
df2, left_on="AN", right_on="AN2", indicator="Join Status", how="left"
)
out = out.drop(columns="AN2")
out["Join Status"] = out["Join Status"].map(
{"left_only": "No_match", "both": "Match"}
)
print(out)
Prints:
AN DNM OCD TRI STATE PLAN Join Status
0 1 343 5656 90 NaN NaN No_match
1 2 344 5657 91 NaN NaN No_match
2 3 345 5658 92 2.0 35.0 Match
3 4 346 5659 93 3.0 19.0 Match
Let's assume you have df1 and df2 and you want to merge the two dataframe
df = df1.merge(df2, how='left', left_on='AN', right_on='AN2')
I hope this will help you
I want to make a new df with simple metrics like mean, sum, min, max calculated on the Value column in the df visible below, grouped by ID, Date and Key.
index
ID
Key
Date
Value
x
y
z
0
655
321
2021-01-01
50
546
235
252345
1
675
321
2021-01-01
50
345
345
34545
2
654
356
2021-02-02
70
345
346
543
I am doing it like this:
final = df.groupby(['ID','Date','Key'])['Value'].first().mean(level=[0,1]).reset_index().rename(columns={'Value':'Value_Mean'})
I use .first() because one Key can occur multiple times in the df but they all have the same Value. I want to aggregate on ID and Date so I am using level=[0,1].
and then I am adding next metrics with pandas merge as:
final = final.merge(df.groupby(['ID','Date','Key'])['Value'].first().max(level=[0,1]).reset_index().rename(columns={'Value':'Value_Max'}), on=['ID','Date'])
And I go like that with other metrics. I wonder if there is a more sophisticated way to do it than repeat it in multiple lines. I know that you can use .agg() and pass a dict with functions but it seems like in that way it isn't possible to specify the level which is important here.
Use DataFrame.drop_duplicates with named aggregation:
df = pd.DataFrame({'ID':[655,655,655,675,654], 'Key':[321,321,333,321,356],
'Date':['2021-01-01','2021-01-01','2021-01-01','2021-01-01','2021-02-02'],
'Value':[50,30,10,50,70]})
print (df)
ID Key Date Value
0 655 321 2021-01-01 50
1 655 321 2021-01-01 30
2 655 333 2021-01-01 10
3 675 321 2021-01-01 50
4 654 356 2021-02-02 70
final = (df.drop_duplicates(['ID','Date','Key'])
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
final = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
df = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False)['Value']
.agg(['mean', 'max'])
.add_prefix('Value_')
.reset_index())
print (df)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0
I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)
For Example for this data frame
df = pd.DataFrame({'Age':['12',np.nan,'32','21','55'],
'Height':["5'7","5'8","5'5",np.nan,"5'10"],
'Weight':[np.nan,'160','165','155','170'],
'Gender':['M','M',np.nan,'F',np.nan],
'Salary':[2900,6550000,7840000,6550000,8950000]})
I want output as:
Age Height Weight Gender Salary
0 12 5'7 NaN M 2.9K
1 NaN 5'8 160 M 6.55M
2 32 5'5 165 NaN 7.84M
3 21 NaN 155 F 6.55M
4 55 5'10 170 NaN 8.95M
One option:
df = pd.DataFrame({'Age':['12',np.nan,'32','21','55'],
'Height':["5'7","5'8","5'5",np.nan,"5'10"],
'Weight':[np.nan,'160','165','155','170'],
'Gender':['M','M',np.nan,'F',np.nan],
'Salary':[29000,650,7840000,6550000,8950000]})
df['s'] = df['Salary'].apply(lambda x:
str(x/1e6).format('{:.2}')+'M'
if x >= 1e6
else str(x/1e3).format('{:.2}')+'K'
if x > 1e3
else str(x).format('{:,}'))
gives
Age Height Weight Gender Salary s
0 12 5'7 NaN M 29000 29.0K
1 NaN 5'8 160 M 650 650
2 32 5'5 165 NaN 7840000 7.84M
3 21 NaN 155 F 6550000 6.55M
4 55 5'10 170 NaN 8950000 8.95M