I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)
Related
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have a data frame that looks like this:
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
I'm getting stuck implementing this comparison:
If factor_1 and factor_2 values match, then output = 2 * multi (Here 2 is kind of a base value). Continue scanning the next rows.
If factor_1 and factor_2 values don't match then:
output = -2. Scan the next row(s).
If factor values still don't match until row R then assign values for output as $-2^2, -2^3, ..., -2^R$ respectively.
When factor values match at row R+1 then assign value for output as $2^(R+1) * multi$.
Repeat the process
The end result will look like this:
This solution does not use recursion:
# sample data
np.random.seed(1)
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
# create a mask
mask = (df['factor_1'] != df['factor_2'])
# get the cumsum from the mask
df['R'] = mask.cumsum() - mask.cumsum().where(~mask).ffill().fillna(0)
# use np.where to create the output
df['output'] = np.where(df['R'] == 0, df['multi']*2, -2**df['R'])
factor_1 factor_2 multi output R
0 2 1 0.419195 -2.000000 1.0
1 4 2 0.685220 -4.000000 2.0
2 1 1 0.204452 0.408904 0.0
3 1 4 0.878117 -2.000000 1.0
4 4 2 0.027388 -4.000000 2.0
5 2 1 0.670468 -8.000000 3.0
6 4 3 0.417305 -16.000000 4.0
7 2 2 0.558690 1.117380 0.0
8 4 3 0.140387 -2.000000 1.0
9 1 1 0.198101 0.396203 0.0
The solution I present is, maybe, a little bit harder to read, but I think it works as you wanted. It combines
numpy.where() in order to make a column based on a condition,
pandas.DataFrame.shift() and pandas.DataFrame.cumsum() to label different groups with consecutive similar values, and
pandas.DataFrame.rank() in order to construct a vector of powers used on previously made df['output'] column.
The code is following.
df['output'] = np.where(df.factor_1 == df.factor_2, -2 * df.multi, 2)
group = ['output', (df.output != df.output.shift()).cumsum()]
df['output'] = (-1) * df.output.pow(df.groupby(group).output.rank('first'))
flag = False
cols = ('factor_1', 'factor_2', 'multi')
z = zip(*[data_dict[col] for col in cols])
for i, (f1, f2, multi) in enumerate(z):
if f1==f2:
output = 2 * multi
flag = False
else:
if flag:
output *= 2
else:
output = -2
flag = True
data_dict['output'][i] = output
The tricky part is flag variable, which tells you whether the previous row had match or not.
I'd like to add values calculated in a for loop to a series so that it can be its own column in a dataframe. So far I've got this: the y values are from a dataframe named block.
N = 12250
for i in range(0,N-1):
y1 = block.iloc[i]['y']
y2 = block.iloc[i+1]['y']
diffy[i] = y2-y1
I'd like to make diffy its own series instead of just replacing the diffy val on each loop
Some sample data (assume N = 5):
N = 5
np.random.seed(42)
block = pd.DataFrame({
'y': np.random.randint(0, 10, N)
})
y
0 6
1 3
2 7
3 4
4 6
You can calculate diffy as follow:
diffy = block['y'].diff().shift(-1)[:-1]
0 -3.0
1 4.0
2 -3.0
3 2.0
Name: y, dtype: float64
diffy is a pandas.Series. If you want list, add .to_list(). If you want a numpy array, add .values
So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0
I have two dataframes:
1) Contains a list of suppliers and their Lat,Long coordinates
sup_essential = pd.DataFrame({'supplier': ['A','B','C'],
'coords': [(51.1235,-0.3453),(52.1245,-0.3423),(53.1235,-1.4553)]})
2) A list of stores and their lat, long coordinates
stores_essential = pd.DataFrame({'storekey': [1,2,3],
'coords': [(54.1235,-0.6553),(49.1245,-1.3423),(50.1235,-1.8553)]})
I want to create an output table that has: store, store_coordinates, supplier, supplier_coordinates, distance for every combination of store and supplier.
I currently have:
test=[]
for row in sup_essential.iterrows():
for row in stores_essential.iterrows():
r = sup_essential['supplier'],stores_essential['storeKey']
test.append(r)
But this just gives me repeats of all the values
Source DFs
In [105]: sup
Out[105]:
coords supplier
0 (51.1235, -0.3453) A
1 (52.1245, -0.3423) B
2 (53.1235, -1.4553) C
In [106]: stores
Out[106]:
coords storekey
0 (54.1235, -0.6553) 1
1 (49.1245, -1.3423) 2
2 (50.1235, -1.8553) 3
Solutions:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
m = pd.merge(sup.assign(x=0), stores.assign(x=0), on='x', suffixes=['1','2']).drop('x',1)
d1 = sup[['coords']].assign(lat=sup.coords.str[0], lon=sup.coords.str[1]).drop('coords',1)
d2 = stores[['coords']].assign(lat=stores.coords.str[0], lon=stores.coords.str[1]).drop('coords',1)
m['dist_km'] = np.ravel(dist.pairwise(np.radians(d1), np.radians(d2)) * 6367)
## -- End pasted text --
Result:
In [135]: m
Out[135]:
coords1 supplier coords2 storekey dist_km
0 (51.1235, -0.3453) A (54.1235, -0.6553) 1 334.029670
1 (51.1235, -0.3453) A (49.1245, -1.3423) 2 233.213416
2 (51.1235, -0.3453) A (50.1235, -1.8553) 3 153.880680
3 (52.1245, -0.3423) B (54.1235, -0.6553) 1 223.116901
4 (52.1245, -0.3423) B (49.1245, -1.3423) 2 340.738587
5 (52.1245, -0.3423) B (50.1235, -1.8553) 3 246.116984
6 (53.1235, -1.4553) C (54.1235, -0.6553) 1 122.997130
7 (53.1235, -1.4553) C (49.1245, -1.3423) 2 444.459052
8 (53.1235, -1.4553) C (50.1235, -1.8553) 3 334.514028