I have an example dataframe as follows
p1 p2 p3 score
0 1 a t1 0.408718
1 1 a t2 0.694732
2 1 a t3 0.001077
3 1 b t1 0.250646
4 1 b t2 0.877506
5 1 b t3 0.033305
6 2 a t1 0.735524
7 2 a t2 0.055166
8 2 a t3 0.579875
9 2 b t1 0.579199
10 2 b t2 0.785301
11 2 b t3 0.339372
p1, p2 and p3 are parameters. What I would like to do is to select the optimal row with p1 and p2 values with the maximum average score based on p3.
For example in the given dataframe, this function should return either one of the rows 9,10,11 since the mean of p3 scores (0.579199, 0.785301, 0.339372) = 0.567958 is the maximum value I can get for any given set of p1 and p2.
My try so far (using pandas groupy) is as follows
temp = []
for eachgroup in df.groupby(['p1', 'p2']).groups.keys():
temp.append(df.groupby(['p1', 'p2']).get_group(eachgroup)['score'])
temp1 = []
for each in temp:
temp1.append(each.mean())
maxidx = temp1.index(max(temp1))
temp[maxidx].index
Returns me the following output
Int64Index([9, 10, 11], dtype='int64')
However, this is very inefficient and works only for smaller dataframes. How can I do the same for bigger dataframes?
In your case
s=df.groupby(['p1','p2']).score.transform('mean')
s.index[s==s.max()]
Out[239]: Int64Index([9, 10, 11], dtype='int64')
Using groupby and transform:
>>> df.groupby(['p1', 'p2']).score.transform('mean').idxmax()
9
If instead you want the combination of p1 and p2 that corresponds with this maximum:
>>> df.groupby(['p1', 'p2']).score.mean().idxmax()
(2, 'b')
The latter would be helpful if you wanted to view the range that created the maximum average:
df.set_index(['p1', 'p2']).loc[(2, 'b')]
p3 score
p1 p2
2 b t1 0.579199
b t2 0.785301
b t3 0.339372
oneliner: groupby p1 and p2, take the mean of the score column for each group. Get the id of the maximum value in the aggregated series.
df.groupby(['p1', 'p2'])['score'].agg(lambda x: x.mean()).idxmax()
>>> ('2', 'b')
Related
Supposing I have a data frame that looks like:
col1 col2
0 10
1 23
2 21
3 15
I want to subtract each value in col2 with the previous row sequentially, so that we are subtracting the previously subtracted value to get:
col1 col2
0 10 # left unchanged as index == 0
1 13 # 23 - 10
2 8 # 21 - 13
3 7 # 15 - 8
Other solutions that I have found all subtract the previous values as is, and not the new subtracted value. I would like to avoid using for loops as I have a very large dataset.
Try below to understand the 'previously subtracted'
b2 = a2 - a1
b3 = a3 - b2 = a3 - a2 + a1
b4 = a4 - b3 = a4 - a3 + a2 - a1
b5 = a5 - b4 = a5 - a4 + a3 - a2 + a1
So we just do
s = np.arange(len(df))%2
s = s + s - 1
df['new'] = np.tril(np.multiply.outer(s,s)).dot(df.col2)
Out[47]: array([10, 13, 8, 7])
Below a simple pure Pandas (doesn't need to import numpy) approach which is a more straightforward concept and easy to understand from code without additional explanations:
Let's first define a function which will do the required work:
def ssf(val):
global last_val
last_val = val - last_val
return last_val
Using the function above the code for creating the new column will be:
last_val = 0
df['new'] = df.col2.apply(ssf)
Let's compare number of functions/methods used by the pure Pandas approach compared to the numpy one in the other answer.
The Pandas approach uses 2 functions/methods: ssf() and .apply() and 1 operation: simple subtraction.
The numpy approach uses 5 functions/methods: .arange(), len(), .tril(), .multiply.outer() and .dot() and 3 operations: array addition, array subtraction and modulo division.
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)
I have two dataframes:
1) Contains a list of suppliers and their Lat,Long coordinates
sup_essential = pd.DataFrame({'supplier': ['A','B','C'],
'coords': [(51.1235,-0.3453),(52.1245,-0.3423),(53.1235,-1.4553)]})
2) A list of stores and their lat, long coordinates
stores_essential = pd.DataFrame({'storekey': [1,2,3],
'coords': [(54.1235,-0.6553),(49.1245,-1.3423),(50.1235,-1.8553)]})
I want to create an output table that has: store, store_coordinates, supplier, supplier_coordinates, distance for every combination of store and supplier.
I currently have:
test=[]
for row in sup_essential.iterrows():
for row in stores_essential.iterrows():
r = sup_essential['supplier'],stores_essential['storeKey']
test.append(r)
But this just gives me repeats of all the values
Source DFs
In [105]: sup
Out[105]:
coords supplier
0 (51.1235, -0.3453) A
1 (52.1245, -0.3423) B
2 (53.1235, -1.4553) C
In [106]: stores
Out[106]:
coords storekey
0 (54.1235, -0.6553) 1
1 (49.1245, -1.3423) 2
2 (50.1235, -1.8553) 3
Solutions:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
m = pd.merge(sup.assign(x=0), stores.assign(x=0), on='x', suffixes=['1','2']).drop('x',1)
d1 = sup[['coords']].assign(lat=sup.coords.str[0], lon=sup.coords.str[1]).drop('coords',1)
d2 = stores[['coords']].assign(lat=stores.coords.str[0], lon=stores.coords.str[1]).drop('coords',1)
m['dist_km'] = np.ravel(dist.pairwise(np.radians(d1), np.radians(d2)) * 6367)
## -- End pasted text --
Result:
In [135]: m
Out[135]:
coords1 supplier coords2 storekey dist_km
0 (51.1235, -0.3453) A (54.1235, -0.6553) 1 334.029670
1 (51.1235, -0.3453) A (49.1245, -1.3423) 2 233.213416
2 (51.1235, -0.3453) A (50.1235, -1.8553) 3 153.880680
3 (52.1245, -0.3423) B (54.1235, -0.6553) 1 223.116901
4 (52.1245, -0.3423) B (49.1245, -1.3423) 2 340.738587
5 (52.1245, -0.3423) B (50.1235, -1.8553) 3 246.116984
6 (53.1235, -1.4553) C (54.1235, -0.6553) 1 122.997130
7 (53.1235, -1.4553) C (49.1245, -1.3423) 2 444.459052
8 (53.1235, -1.4553) C (50.1235, -1.8553) 3 334.514028
I'm looking for the most efficient way of finding the intersection of two different-sized matrices. Each matrix has three variables (columns) and a varying number of observations (rows). For example, matrix A:
a = np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
b = np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003'; 9 9 3000; 7 7 1000')
If I set the tolerance for each column as col1 = 1, col2 = 2, and col3 = 10, I would want a function such that it would output the indices in a and b that are within their respective tolerance, for example:
[x1, x2] = func(a, b, col1, col2, col3)
print x1
>> [2 3]
print x2
>> [1 3]
You can see by the indices, that element 2 of a is within the tolerances of element 1 of b.
I'm thinking I could loop through each element of matrix a, check if it's within the tolerances of each element in b, and do it that way. But it seems inefficient for very large data sets.
Any suggestions for alternatives to a looping method for accomplishing this?
If you don't mind working with NumPy arrays, you could exploit broadcasting for a vectorized solution. Here's the implementation -
# Set tolerance values for each column
tol = [1, 2, 10]
# Get absolute differences between a and b keeping their columns aligned
diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))
# Compare each row with the triplet from `tol`.
# Get mask of all matching rows and finally get the matching indices
x1,x2 = np.nonzero((diffs < tol).all(2))
Sample run -
In [46]: # Inputs
...: a=np.matrix('1 5 1003; 2 4 1002; 4 3 1008; 8 1 2005')
...: b=np.matrix('7 9 1006; 4 4 1007; 7 7 1050; 8 2 2003; 9 9 3000; 7 7 1000')
...:
In [47]: # Set tolerance values for each column
...: tol = [1, 2, 10]
...:
...: # Get absolute differences between a and b keeping their columns aligned
...: diffs = np.abs(np.asarray(a[:,None]) - np.asarray(b))
...:
...: # Compare each row with the triplet from `tol`.
...: # Get mask of all matching rows and finally get the matching indices
...: x1,x2 = np.nonzero((diffs < tol).all(2))
...:
In [48]: x1,x2
Out[48]: (array([2, 3]), array([1, 3]))
Large datasizes case : If you are working with huge datasizes that cause memory issues and since you already know that the number of columns is a small number 3, you might want to have a minimal loop of 3 iterations and save huge memory footprint, like so -
na = a.shape[0]
nb = b.shape[0]
accum = np.ones((na,nb),dtype=bool)
for i in range(a.shape[1]):
accum &= np.abs((a[:,i] - b[:,i].ravel())) < tol[i]
x1,x2 = np.nonzero(accum)