Pandas create column with names of columns with lowest match

Pandas create column with names of columns with lowest match - python

I have Pandas dataframe where I have points and corresponding lengths to another points. I am able to get minimal value of the calculated columns, however, I need the column names itself. I am unable to figure out how can I get the column names corresponding to values in a new column. My dataframe looks like this:
df.head():
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 218.039561
71 100.0 381.0 925.324708 ... 647.707783 169.856557 169.856557
61 225.0 69.0 751.353014 ... 515.152768 122.377490 122.377490
0 and 1 are datapoints, the rest are distances to datapoints #1 to 7, in some cases the number of points can differ, does not really matter for the question. The code I use to count min is following:
new = users.iloc[:,2:].min(axis=1)
users["min"] = new
#could also do the following way
#users.assign(Min=lambda users: users.iloc[:,2:].min(1))
This is quite simple and there is no much about finding the minimum of multiple columns. However, I need to get the col name instead of the value. So my desired output would look like this (in the example all are 7, which is not rule):
0 1 2 ... 6 7 min
9 58.0 94.0 984.003636 ... 696.667367 218.039561 7
71 100.0 381.0 925.324708 ... 647.707783 169.856557 7
61 225.0 69.0 751.353014 ... 515.152768 122.377490 7
Is there a simple way to achieve this?

Use df.idxmin:
In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1)
In [550]: df
Out[550]:
0 1 2 6 7 min
9 58.0 94.0 984.003636 696.667367 218.039561 7
71 100.0 381.0 925.324708 647.707783 169.856557 7
61 225.0 69.0 751.353014 515.152768 122.377490 7

Related

Find cumcount and agg func based on past records of each group

I have a dataframe like as shown below
df = pd.DataFrame(
{'stud_name' : ['ABC', 'ABC','ABC','DEF',
'DEF','DEF'],
'qty' : [123,31,490,518,70,900],
'trans_date' : ['13/11/2020','10/1/2018','11/11/2017','27/03/2016','13/05/2010','14/07/2008']})
I would like to do the below
a) for each stud_name, look at their past data (full past data) and compute the min, max and mean of qty column
Please note that the 1st record/row for every unique stud_name will be NA because there is no past data (history) to look at and compute the aggregate statistics
I tried something like below but the output is incorrect
df['trans_date'] = pd.to_datetime(df['trans_date'])
df.sort_values(by=['stud_name','trans_date'],inplace=True)
df['past_transactions'] = df.groupby('stud_name').cumcount()
df['past_max_qty'] = df.groupby('stud_name')['qty'].expanding().max().values
df['past_min_qty'] = df.groupby('stud_name')['qty'].expanding().min().values
df['past_avg_qty'] = df.groupby('stud_name')['qty'].expanding().mean().values
I expect my output to be like as shown below

We can use custom function to calculate the past statistics per student
def past_stats(q):
return (
q.expanding()
.agg(['max', 'min', 'mean'])
.shift().add_prefix('past_')
)
df.join(df.groupby('stud_name')['qty'].apply(past_stats))
stud_name qty trans_date past_max past_min past_mean
2 ABC 490 2017-11-11 NaN NaN NaN
1 ABC 31 2018-10-01 490.0 490.0 490.0
0 ABC 123 2020-11-13 490.0 31.0 260.5
5 DEF 900 2008-07-14 NaN NaN NaN
4 DEF 70 2010-05-13 900.0 900.0 900.0
3 DEF 518 2016-03-27 900.0 70.0 485.0

Find First occurrence of a user and assign values to it

Here's what my data look like:
user_id
prior_elapse_time
timestamp
115
NaN
0
115
10
1000
115
5
2000
222212
NaN
0
222212
8
500
222212
12
3000
222212
NaN
5000
222212
15
8000
I found similar posts that teach me how to get the first occurrence of a user:
train_df.groupby('user_id')['prior_elapsed_time'].first()
This would nicely get me all the first appearance of each user. However, now I'm at a loss at how to correctly assign 0 to the NaN only at the first occurrence of the user. Due to logging error, you can see that NaN appears elsewhere, but I only want to assign 0 to the boldfaced NaN.
I also tried
train_df['prior_elapse_time'][(train_df['prior_elapse_time'].isna()) & (train_df['timestamp'] == 0)] = 0
But then I get the "copy" vs. "view" assignment problem (which I don't fully understand).
Any help?

If your df is sorted by user_id:
>>> df.loc[df.user_id.diff().ne(0), 'prior_elapse_time'] = 0
>>> df
user_id prior_elapse_time timestamp
0 115 0.0 0
1 115 10.0 1000
2 115 5.0 2000
3 222212 0.0 0
4 222212 8.0 500
5 222212 12.0 3000
6 222212 NaN 5000
7 222212 15.0 8000
Alternatively, use pandas.Series.mask
>>> df['prior_elapse_time'] = df.prior_elapse_time.mask(df.user_id.diff().ne(0), 0)
If not sorted, then get the indices via groupby:
>>> idx = df.reset_index().groupby('user_id')['index'].first()
>>> df.loc[idx, 'prior_elapse_time'] = 0
If you want to set 0 to only those places where it was previously NaN, add pandas.Series.isnull mask to the columns.
>>> df.loc[
(df.user_id.diff().ne(0) & df.prior_elapse_time.isnull()),
'prior_elapse_time'
] = 0

Compare Misaligned Series columns Pandas

Comparing 2 series objects of different sizes:
IN[248]:df['Series value 1']
Out[249]:
0 70
1 66.5
2 68
3 60
4 100
5 12
Name: Stu_perc, dtype: int64
IN[250]:benchmark_value
#benchamrk is a subset of data from df2 only based on certain filters
Out[251]:
0 70
Name: Stu_perc, dtype: int64
Basically I wish to compare df['Series value 1'] with benchmark_value and return the values which are greater than 95% of benchark value in a column Matching list. Type of both of these is Pandas series. However sizes are different for both, hence it is not comparing.
Input given:
IN[252]:df['Matching list']=(df2['Series value 1']>=0.95*benchmark_value)
OUT[253]: ValueError: Can only compare identically-labeled Series objects
Output wanted:
[IN]:
df['Matching list']=(df2['Stu_perc']>=0.95*benchmark_value)
#0.95*Benchmark value is 66.5 in this case.
df['Matching list']
[OUT]:
0 70
1 66.5
2 68
3 NULL
4 100
5 NULL

Because benchmark_value is Series, for scalar need select first value of Series by Series.iat and set NaNs by Series.where:
benchmark_value = pd.Series([70], index=[0])
val = benchmark_value.iat[0]
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 NaN
4 100.0 100.0
5 12.0 NaN
General solution also working if benchmark_value is empty is next with iter for return first value of Series and if not exist use default value - here 0:
benchmark_value = pd.Series([])
val = next(iter(benchmark_value), 0)
df2['Matching list']= df2['Stu_perc'].where(df2['Stu_perc']>=0.95*val)
print (df2)
Stu_perc Matching list
0 70.0 70.0
1 66.5 66.5
2 68.0 68.0
3 60.0 60.0
4 100.0 100.0
5 12.0 12.0

is your benchmark value is single-value?
If yes, you might need to convert benchmark_value which is a series to a number (without index) by using df['Matching list']=(df['Stu_perc']>=0.95*benchmark_value.values)

It seems benchmark value is a Series with a single row, so not an actual number, I believe you need to access it first.
But this will return a list of Booleans. To get just the values that you want, you can use the where function.
Try this:
df['Matching list']= df2['Stu_perc'].where(df2['Stu_perc'] >=0.95*benchmark_value[0][0]))

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.

You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

how to group all the data as fast as possible?

I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.

I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8

Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)

- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas create column with names of columns with lowest match - python

Use df.idxmin: In [549]: df['min'] = df.iloc[:,2:].idxmin(axis=1) In [550]: df Out[550]: 0 1 2 6 7 min 9 58.0 94.0 984.003636 696.667367 218.039561 7 71 100.0 381.0 925.324708 647.707783 169.856557 7 61 225.0 69.0 751.353014 515.152768 122.377490 7

Related

Find cumcount and agg func based on past records of each group

Find First occurrence of a user and assign values to it

Compare Misaligned Series columns Pandas

Weighted Means for columns in Pandas DataFrame including Nan

how to group all the data as fast as possible?

Categories

Resources