Pandas vlookup like operation to a list - python

I am unable to properly explain my requirement, but I can show the expected result.
I have a dataframe that looks like so:
Series1
Series2
1370307
1370306
927092
927091
925392
925391
925390
925389
2344089
2344088
1827855
1827854
1715793
1715792
2356467
2356466
1463264
1463263
1712684
1712683
actual dataframe size: 902811 rows × 2 columns
then another dataframe of unique values of Series2. This I've done using value counts.
df2 = df['Series2'].value_counts().rename_axis('Series2').to_frame('counts').reset_index()
Then I need a list of matching Series1 values for each Series2 value:
The expected result is:
Series2
counts
Series1_List
2543113
6
[2543114, 2547568, 2559207, 2563778, 2564330, 2675803]
2557212
6
[2557213, 2557301, 2559192, 2576080, 2675693, 2712790]
2432032
5
[2432033, 2444169, 2490928, 2491392, 2528056]
2559269
5
[2559270, 2576222, 2588034, 2677710, 2713207]
2439554
5
[2439555, 2441882, 2442272, 2443590, 2443983]
2335180
5
[2335181, 2398282, 2527060, 2527321, 2565487]
2494111
4
[2494112, 2495321, 2526026, 2528492]
2559195
4
[2559196, 2570172, 2634537, 2675718]
2408775
4
[2408776, 2409117, 2563765, 2564320]
2408773
4
[2408774, 2409116, 2563764, 2564319]
I achieve this (although only for a subset of 50 rows) using the following code:
df2.loc[:50,'Series1_List'] = df2.loc[:50,'Series2'].apply(lambda x: df[df['Series2']==x]['Series1'].tolist())
If I do this for the whole dataframe it wouldn't complete even in 20 minutes.
So the question is whether there is a faster and efficient method of achieving the result?

IIUC, use:
df2 = (df.groupby('Series2', as_index=False)
.agg(counts=('Series1', 'count'), Series1_List=('Series1', list))
)

Related

Run values of one dataframe through another and find the index of similar value from dataframe

I have two dataframes both consisting of a 1 column with 62 values each:
Distance_obs = [
0.0
0.9084
2.1931
2.85815
3.3903
3.84815
4.2565
4.6287
4.97295
5.29475
5.598
5.8856
6.15975
6.4222
6.67435
6.9173
7.152
7.37925
7.5997
7.8139
8.02235
8.22555
8.42385
8.61755
8.807
8.99245
9.17415
9.35235
9.5272
9.6989
9.86765
10.0335
10.1967
10.3574
10.5156
10.6714
10.825
10.9765
11.1259
11.2732
11.4187
11.5622
11.7041
11.8442
11.9827
12.1197
12.2552
12.3891
12.5216
12.6527
12.7825
12.9109
13.0381
13.1641
13.2889
13.4126
13.5351
13.6565
13.7768
13.8961
14.0144
14.0733
]
and
Cell_mid = [0.814993
1.96757
2.56418
3.04159
3.45236
3.8187
4.15258
4.46142
4.75013
5.02221
5.28026
5.52624
5.76172
5.98792
6.20588
6.41642
6.62027
6.81802
7.01019
7.19723
7.37952
7.55742
7.73122
7.9012
8.0676
8.23063
8.39049
8.54736
8.70141
8.85277
9.00159
9.14798
9.29207
9.43396
9.57374
9.71152
9.84736
9.98136
10.1136
10.2441
10.373
10.5003
10.626
10.7503
10.8732
10.9947
11.1149
11.2337
11.3514
11.4678
11.5831
11.6972
11.8102
11.9222
12.0331
12.143
12.2519
12.3599
12.4669
12.573
12.6782
12.7826
]
I want to run each element in Distance_obs to run through the values in Cell_mid and find the corresponding index of nearest value which matches the element.
I have been trying using the following:
for i in Distance_obs:
Required_value = (np.abs(Cell_mid - i)).idxmin()
But I get
error: ufunc 'subtract' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32')
One way to do this, could be as follows:
Use pd.merge_asof, passing "nearest" to the direction parameter.
Now, from the merged result select column Cell_mid, and use Series.map with a pd.Series where the values and index of its original df (here: df2) are swapped.
df['Cell_mid_index'] = pd.merge_asof(df, df2,
left_on='Distance_obs',
right_on='Cell_mid',
direction='nearest')\
['Cell_mid'].map(pd.Series(df2['Cell_mid'].index.values,
index=df2['Cell_mid']))
print(df.head())
Distance_obs Cell_mid_index
0 0.00000 0
1 0.90840 0
2 2.19310 1
3 2.85815 3
4 3.39030 4
So, at the intermediate step, we had a merged df like this:
print(pd.merge_asof(df, df2, left_on='Distance_obs',
right_on='Cell_mid', direction='nearest').head())
Distance_obs Cell_mid
0 0.00000 0.814993
1 0.90840 0.814993
2 2.19310 1.967570
3 2.85815 3.041590
4 3.39030 3.452360
And then with .map we are retrieving the appropriate index values from df2.
Data used
import pandas as pd
Distance_obs = [0.0, 0.9084, 2.1931, 2.85815, 3.3903, 3.84815, 4.2565,
4.6287, 4.97295, 5.29475, 5.598, 5.8856, 6.15975, 6.4222,
6.67435, 6.9173, 7.152, 7.37925, 7.5997, 7.8139, 8.02235,
8.22555, 8.42385, 8.61755, 8.807, 8.99245, 9.17415, 9.35235,
9.5272, 9.6989, 9.86765, 10.0335, 10.1967, 10.3574, 10.5156,
10.6714, 10.825, 10.9765, 11.1259, 11.2732, 11.4187, 11.5622,
11.7041, 11.8442, 11.9827, 12.1197, 12.2552, 12.3891, 12.5216,
12.6527, 12.7825, 12.9109, 13.0381, 13.1641, 13.2889, 13.4126,
13.5351, 13.6565, 13.7768, 13.8961, 14.0144, 14.0733]
df = pd.DataFrame(Distance_obs, columns=['Distance_obs'])
Cell_mid = [0.814993, 1.96757, 2.56418, 3.04159, 3.45236, 3.8187, 4.15258,
4.46142, 4.75013, 5.02221, 5.28026, 5.52624, 5.76172, 5.98792,
6.20588, 6.41642, 6.62027, 6.81802, 7.01019, 7.19723, 7.37952,
7.55742, 7.73122, 7.9012, 8.0676, 8.23063, 8.39049, 8.54736,
8.70141, 8.85277, 9.00159, 9.14798, 9.29207, 9.43396, 9.57374,
9.71152, 9.84736, 9.98136, 10.1136, 10.2441, 10.373, 10.5003,
10.626, 10.7503, 10.8732, 10.9947, 11.1149, 11.2337, 11.3514,
11.4678, 11.5831, 11.6972, 11.8102, 11.9222, 12.0331, 12.143,
12.2519, 12.3599, 12.4669, 12.573, 12.6782, 12.7826]
df2 = pd.DataFrame(Cell_mid, columns=['Cell_mid'])

Apply function to multiple row pandas

Suppose I have a dataframe like this
0 5 10 15 20 25 ...
action_0_Q0 0.299098 0.093973 0.761735 0.058112 0.013463 0.164322 ...
action_0_Q1 0.463095 0.468425 0.202679 0.742424 0.865005 0.479546 ...
action_0_Q2 0.237807 0.437602 0.035587 0.199465 0.121532 0.356132 ...
action_1_Q0 0.263191 0.176407 0.471295 0.082457 0.029566 0.426428 ...
action_1_Q1 0.508573 0.490355 0.431732 0.249432 0.189732 0.396947 ...
action_1_Q2 0.228236 0.333238 0.096973 0.668111 0.780702 0.176625 ...
action_2_Q0 0.256632 0.122589 0.495720 0.059918 0.824424 0.384998 ...
action_2_Q1 0.485362 0.462969 0.420790 0.211578 0.155771 0.186493 ...
action_2_Q2 0.258006 0.414442 0.083490 0.728504 0.019805 0.428509 ...
This dataframe may be very large (a lot of rows, about 3000 columns).
What I have to do is to apply a function to each column, which in turn returns a distance matrix. However, such function should be applied by considering 3 rows at once. For example, taking the first column:
a = distance_function([[0.299098, 0.463095, 0.237807], [0.263191, 0.508573, 0.228236], [0.256632, 0.485362, 0.258006]])
# Returns
print(a.shape) -> (3,3)
Now, this is not overly complicated via a for loop, but the time required would be huge. Is there some alternative way?
IIUC use:
df = df.apply(lambda x: distance_function(x.to_numpy().reshape(-1,3)))
If need flatten values:
from itertools import chain
df = df.apply(lambda x: list(chain.from_iterable(distance_function(x.to_numpy().reshape(-1,3))))

Find columns within a certain percentile of a DataFrame

Having a multi-column dataframe, I'm interested in how to keep/get the part of the dataframe that fall between the 25th and 75th percentiles per each column ?
I need to remove the rows (which are just time steps) that have values outside the 25-75 percentile range
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'400.0': [13.909261, 13.758734, 13.513627, 13.095409, 13.628918, 12.782643, 13.278548, 13.160153, 12.155895, 12.152373, 12.147820, 13.023997, 15.010729, 13.006050, 13.002356],
'401.0': [14.581624, 14.173803, 13.757856, 14.223524, 14.695623, 13.818065, 13.300235, 13.173674, 14.145402, 14.144456, 13.142969, 13.022471, 14.010802, 14.006181, 14.002641],
'402.0': [15.253988, 15.588872, 15.002085, 15.351638, 14.762327, 14.853486, 15.321922, 14.187195, 15.134910, 15.136539, 15.138118, 15.020945, 15.010875, 15.006313, 15.002927],
'403.0': [15.633908, 14.833914, 15.146499, 15.431543, 15.798185, 14.874350, 14.333470, 14.192128, 15.130119, 15.134795, 15.136049, 15.019307, 15.012037, 15.006674, 15.003002],
})
I expect to see a lower number of rows, so I have to eliminate a range of measurements that act as outliers of the timeseries.
This is from the original data set, where the x-axis shows the rows. So I need somehow to remove this blob by setting a percentile criteria
At the end I'd take the most strict criteria to apply it for the entire dataframe
I'm not 100% sure this is what you want, but IIUC, you can create a mask, then apply it to your dataframe.
df1[df1.apply(lambda x: x.between(x.quantile(.25), x.quantile(.75))).all(1)]
400.0 401.0 402.0 403.0
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
That will drop any row which contains any value in any column that falls outside your range.
If instead you want to only drop rows which contain all values that fall outside your range, you can use:
df1[df1.apply(lambda x: x.between(x.quantile(.25), x.quantile(.75))).any(1)]
400.0 401.0 402.0 403.0
2 13.513627 13.757856 15.002085 15.146499
3 13.095409 14.223524 15.351638 15.431543
5 12.782643 13.818065 14.853486 14.874350
6 13.278548 13.300235 15.321922 14.333470
7 13.160153 13.173674 14.187195 14.192128
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
10 12.147820 13.142969 15.138118 15.136049
11 13.023997 13.022471 15.020945 15.019307
12 0.010729 14.010802 15.010875 15.012037
13 0.006050 14.006181 15.006313 15.006674
14 0.002356 14.002641 15.002927 15.003002
Rows are retained here if any of the values in any column falls within the percentile range in its respective column.
It is going to be much faster to operate on the underlying numpy arrays here:
a = df1.values
q1 = np.quantile(a, q=0.25, axis=0)
q2 = np.quantile(a, q=0.75, axis=0)
mask = ((q1 < a) & (a < q2)).all(1)
df1[mask]
400.0 401.0 402.0 403.0
8 12.155895 14.145402 15.134910 15.130119
9 12.152373 14.144456 15.136539 15.134795
Invert the mask (df[~mask]) if you want to exclude those rows

Python pandas split column with NaN values

Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)

Operating on pandas dataframes that may or may not be multiIndex

I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).
you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464

Categories