How to see distribution of values in a dataframe - python

I have a dataframe that loads a CSV. The csv is like this:
PROFIT STRING
16 A_B_C_D
3 A_D_C
-4 A_D_C
20 A_X_C
10 A_F_S
PROFIT is a float, string is a list of characters. The underscore "_" seperates them, so that A_B_C_D would be A,B,C and D individually.
I'm trying to see the profit distribution by character.
eg:
A:
Total profit = 16+3-4+20+10 = 45
Mean = xxx
Median = yyy
B:
Total profit = 16+3 = 19
Mean = zzzz
etc...
Can this be done using pandas, and if so how?

Split and explode by column STRING, then do groupby + agg on column PROFIT
df.assign(STRING=df['STRING'].str.split('_'))\
.explode('STRING').groupby('STRING')['PROFIT'].agg(['sum', 'mean', 'median'])
sum mean median
STRING
A 45 9.00 10.0
B 16 16.00 16.0
C 35 8.75 9.5
D 15 5.00 3.0
F 10 10.00 10.0
S 10 10.00 10.0
X 20 20.00 20.0

Related

Find cumcount and agg func based on past records of each group

I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below
We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0

Compare two dataframes and add new values in second dataframe to the first data-frame

I have two dataframes with the same headers
df1\
**Date prix moyen mini maxi H-Value C-Value**
0 17/09/20 8 6 9 122 2110122\
1 15/09/20 8 6 9 122 2110122\
2 10/09/20 8 6 9 122 2110122
and
df2
**Date prix moyen mini maxi H-Value C-Value**\
1 07/09/17 1.80 1.50 2.00 170 3360170\
1 17/09/20 8.00 6.00 9.00 122 2110122\
2 17/09/20 9.00 8.00 12.00 122 2150122\
3 17/09/20 10.00 8.00 12.00 122 14210122
I want to compare the two dataframes alone 3 parameters (Date, H-Value and C-Value), identify the new values present in df2 (values which do not occur in df1) and then append them in df1.
I am using
df_unique = df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=True)
and it is not working in identifying the new values in df2. The resulting table only identifies some values and not others.
Where am I going wrong?
What is your question?
In [4]: df2[~(df2['Date'].isin(df1['Date']) & df2['H-Value'].isin(df1['H-Value']
...: ) & df2['C-Value'].isin(df1['C-Value']) )].dropna().reset_index(drop=Tru
...: e)
Out[4]:
Date prix moyen mini maxi H-Value C-Value
0 1 07/09/17 1.8 1.5 2.0 170 3360170
1 2 17/09/20 9.0 8.0 12.0 122 2150122
2 3 17/09/20 10.0 8.0 12.0 122 14210122
These are all rows in df2 that are not present in df1. Looks good to me...
I was actually able to solve the problem. The issue was not the command being used to compare the two datasets but rather the fact that one of the columns in df2 had a data format different from the same column in df1, rendering a direct comparison not possible.
Here's what I try
df1 = pd.concat([df1, df2[~df2.set_index(['Date', 'H-Value', 'C-Value']).index.isin(df1.set_index(['Date', 'H-Value', 'C-Value']).index)]])

Creating bins and extracting values for that bin

I have a pandas dataframe which looks like
Temperature_lim Factor
0 32 0.95
1 34 1.00
2 36 1.06
3 38 1.10
4 40 1.15
I need to extract factor value for any given temperature , if my current temperature is 31, my factor is 0.95. If my current temp is 33, factor is 1, if my current_temp is 38.5 factor is 1.15. So by giving my current temperature, i would like to know the factor for that temperature.
I can do this using multiple if else statements, but is there any effective way I can do it by creating bins/intervals in pandas or python.
Thank you
Use cut with add -np.inf to values of column Temperature_lim and missing values by last value of Factor value:
df1 = pd.DataFrame({'Temp':[31,33,38.5, 40, 41]})
b = [-np.inf] + df['Temperature_lim'].tolist()
lab = df['Factor']
df1['new'] = pd.cut(df1['Temp'], bins=b, labels=lab, right=False).fillna(lab.iat[-1])
print (df1)
Temp new
0 31.0 0.95
1 33.0 1.00
2 38.5 1.15
3 40.0 1.15
4 41.0 1.15

Filtering a dataframe by a list

I have the following dataframe
Dataframe:
Date Name Value Rank Mean
01/02/2019 A 10 100 8.2
02/03/2019 A 9 120 7.9
01/03/2019 B 3 40 6.4
03/02/2019 B 1 39 5.9
...
And following list:
date=['01/02/2019','03/02/2019'...]
I would like to filter the df by the list, but as a date range, so for each value in the list I would like to bring back data between the date and the date-30 days
I am using numpy broadcast here, notice this method is o(n*m) , which mean if both of the df and date list is huge , it will exceed the memory limit
s=pd.to_datetime(date).values
df.Date=pd.to_datetime(df.Date)
s1=df.Date.values
t=(s-s1[:,None]).astype('timedelta64[D]').astype(int)
df[np.any((t>=0)&(t<=30),1)]
Out[120]:
Date Name Value Rank Mean
0 2019-01-02 A 10 100 8.2
1 2019-02-03 A 9 120 7.9
3 2019-03-02 B 1 39 5.9
If your date is a string, just do:
df[df.date.isin(list_of_dates)]

Pandas: how to change the data type of values of a row?

I have the following DataFrame:
actor Daily Total actor1 actor2
Day
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 27.5 12.5 15.0
How do I change the data type of 'Avg' row to integer? How do I round those values in the row?
In pandas after add new row filled by floats all columns are changed to floats.
Possible solution is round and convert all columns:
df = df.round().astype(int)
Or add new Series converted to integer:
df = df.append(df.mean().rename('Avg').round().astype(int))
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 28 12 15
If want convert only columns with row values filled by whole numbers:
d = dict.fromkeys(df.columns[df.loc['Avg'] == df.loc['Avg'].astype(int)], 'int')
df = df.astype(d)
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25.0 10.0 15
2019-01-02 30.0 15.0 15
Avg 27.5 12.5 15
Use loc to access index then use numpy.round in apply.
import numpy as np
df.loc['Avg'] = df.loc['Avg'].apply(np.round)

Categories