I'd like to add values calculated in a for loop to a series so that it can be its own column in a dataframe. So far I've got this: the y values are from a dataframe named block.
N = 12250
for i in range(0,N-1):
y1 = block.iloc[i]['y']
y2 = block.iloc[i+1]['y']
diffy[i] = y2-y1
I'd like to make diffy its own series instead of just replacing the diffy val on each loop
Some sample data (assume N = 5):
N = 5
np.random.seed(42)
block = pd.DataFrame({
'y': np.random.randint(0, 10, N)
})
y
0 6
1 3
2 7
3 4
4 6
You can calculate diffy as follow:
diffy = block['y'].diff().shift(-1)[:-1]
0 -3.0
1 4.0
2 -3.0
3 2.0
Name: y, dtype: float64
diffy is a pandas.Series. If you want list, add .to_list(). If you want a numpy array, add .values
Related
The dataframe I am working with looks like this:
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
Here,
the features have the following meanings:
FWeight - weight of each fragment (or row)
fsim - similarity score between the two columns cap1 and cap2
The weighted formula is:
For example,
For vid2 "-_aaMGK6GGw_57_61", COS = 2
Thus, the two rows with vid2 comes under this.
fsim FWeight
0 0.253792 0.750000
1 0.192565 0.250000
The calculated value vid_score needs to be
vid_score(1st video) = (fsim[0] * FWeight[0] + fsim[1] * FWeight[1])/(FWeight[0] + FWeight[1])
The expected output value vid_score for vid2 = -_aaMGK6GGw_57_61 is
(0.750000) * (0.253792) + (0.250000) * (0.192565)
= 0.238485 (Final value)
For some videos, this COS = 1, 2, 3, 4, 5, ...
Thus this needs to be dynamic
I am trying to calculate the weighted similarity score for each video ID that is vid2 here. However, there are a number of captions and weights respectively for each video. It varies, some have 2, some 1, some 3, etc. This number of segments and captions has been stored in the feature COS (that is, count of segments).
I want to iterate through the dataframe where score for each video is stored as a weighted average score of the fsim (fragment similarity score). But the number of iteration is not regular.
I have tried this code. But I am not able to iterate dynamically with the iteration factor being COS instead of just a constant value
vems_score = 0.0
video_scores = []
for i, row in merged.iterrows():
vid_score = 0.0
total_weight = 0.0
for j in range(row['COS']):
total_weight = total_weight + row['FWeight']
vid_score = vid_score + (row['FWeight'] * row['fsim'])
i = i + row['COS']
vid_score = vid_score/total_weight
video_scores.append(vid_score)
print(video_scores)
Here is my sol which you can modify/optimize to your needs.
import pandas as pd, numpy as np
def computeSim():
vid=[1,1,2,2,3]
cos=[2,2,2,2,1]
fsim=[0.25,.19,.56,.17,.27]
weight = [.75,.25,.33,.66,.71]
df= pd.DataFrame({'vid':vid,'cos':cos,'fsim':fsim,'fw':weight})
print(df)
df2 = df.groupby('vid')
similarity=[]
for group in df2:
similarity.append( np.sum(group[1]['fsim']*group[1]['fw'])/ np.sum(group[1]['fw']))
return similarity
output:
0.235
0.30000000000000004
0.27
Solution
Try this with your data. I assume that you stored the dataframe as df.
df['Prod'] = df['fsim']*df['FWeight']
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output with your data: Dummy Data (B)
vid2 COS
-_aaMGK6GGw_57_61 2 0.238485
-_hbPLsZvvo_18_25 1 0.275962
-_hbPLsZvvo_5_8 2 0.307548
dtype: float64
Dummy Data: A
I made the following dummy data to test a few aspects of the logic.
df = pd.DataFrame({'vid2': [1,1,2,5,2,6,7,4,8,7,6,2],
'COS': [2,2,3,1,3,2,2,1,1,2,2,3],
'fsim': np.random.rand(12),
'FWeight': np.random.rand(12)})
df['Prod'] = df['fsim']*df['FWeight']
print(df)
# Groupby and apply formula
grp = df.groupby(['vid2', 'COS'])
result = grp['Prod'].sum()/grp['FWeight'].sum()
print(result)
Output:
vid2 COS
1 2 0.405734
2 3 0.535873
4 1 0.534456
5 1 0.346937
6 2 0.369810
7 2 0.479250
8 1 0.065854
dtype: float64
Dummy Data: B (OP Provided)
This is your dummy data. I made this script so anyone could easily run it and load the data as a dataframe.
import pandas as pd
from io import StringIO
s = """
vid2 COS fsim FWeight
0 -_aaMGK6GGw_57_61 2 0.253792 0.750000
1 -_aaMGK6GGw_57_61 2 0.192565 0.250000
2 -_hbPLsZvvo_5_8 2 0.562707 0.333333
3 -_hbPLsZvvo_5_8 2 0.179969 0.666667
4 -_hbPLsZvvo_18_25 1 0.275962 0.714286
"""
df = pd.read_csv(StringIO(s), sep='\s+')
#print(df)
I have a data frame that looks like this:
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
I'm getting stuck implementing this comparison:
If factor_1 and factor_2 values match, then output = 2 * multi (Here 2 is kind of a base value). Continue scanning the next rows.
If factor_1 and factor_2 values don't match then:
output = -2. Scan the next row(s).
If factor values still don't match until row R then assign values for output as $-2^2, -2^3, ..., -2^R$ respectively.
When factor values match at row R+1 then assign value for output as $2^(R+1) * multi$.
Repeat the process
The end result will look like this:
This solution does not use recursion:
# sample data
np.random.seed(1)
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
# create a mask
mask = (df['factor_1'] != df['factor_2'])
# get the cumsum from the mask
df['R'] = mask.cumsum() - mask.cumsum().where(~mask).ffill().fillna(0)
# use np.where to create the output
df['output'] = np.where(df['R'] == 0, df['multi']*2, -2**df['R'])
factor_1 factor_2 multi output R
0 2 1 0.419195 -2.000000 1.0
1 4 2 0.685220 -4.000000 2.0
2 1 1 0.204452 0.408904 0.0
3 1 4 0.878117 -2.000000 1.0
4 4 2 0.027388 -4.000000 2.0
5 2 1 0.670468 -8.000000 3.0
6 4 3 0.417305 -16.000000 4.0
7 2 2 0.558690 1.117380 0.0
8 4 3 0.140387 -2.000000 1.0
9 1 1 0.198101 0.396203 0.0
The solution I present is, maybe, a little bit harder to read, but I think it works as you wanted. It combines
numpy.where() in order to make a column based on a condition,
pandas.DataFrame.shift() and pandas.DataFrame.cumsum() to label different groups with consecutive similar values, and
pandas.DataFrame.rank() in order to construct a vector of powers used on previously made df['output'] column.
The code is following.
df['output'] = np.where(df.factor_1 == df.factor_2, -2 * df.multi, 2)
group = ['output', (df.output != df.output.shift()).cumsum()]
df['output'] = (-1) * df.output.pow(df.groupby(group).output.rank('first'))
flag = False
cols = ('factor_1', 'factor_2', 'multi')
z = zip(*[data_dict[col] for col in cols])
for i, (f1, f2, multi) in enumerate(z):
if f1==f2:
output = 2 * multi
flag = False
else:
if flag:
output *= 2
else:
output = -2
flag = True
data_dict['output'][i] = output
The tricky part is flag variable, which tells you whether the previous row had match or not.
So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0
There are three panda series
x = pd.Series([220,340,500,600,700,900,540,60])
y = pd.Series([2,1,2,2,1])
z = pd.Series([])
Each element of y will denote how many elements to add and to be put into z
example : if series has 2 in the start, then i will add first two elements at the start 220 and 340 to get 560 and then put it in z as its first element. Next I have 1 in y that means i will take 500 from x (third element) and put it in z as its second element and so on
Here is what I have tried
j = 0
for i in y:
par = y[i]
z[i] = x[j:par + j].sum()
j = j+par
Groupby y's index repeated:
x.groupby(y.index.repeat(y)).sum()
0 560
1 500
2 1300
3 1440
4 60
dtype: int64
If the length mismatches, this will lead to a ValueError. In that case, a safer alternative is to groupby the cumsum, repeated, and reset the index:
x.groupby(y.cumsum().repeat(y).reset_index(drop=True)).sum()
Here's my take:
df = x.to_frame(name='x').reset_index(drop=True)
df['cat'] = pd.cut(df.index+1, y.cumsum(), labels=False)
df['cat'] = df['cat'].fillna(-1).add(1)
z = df.groupby('cat').x.sum()
Out:
cat
0.0 560
1.0 500
2.0 1300
3.0 1440
4.0 60
Name: x, dtype: int64
it is the index conflict issue, just update your loop to use a range instead
j = 0
for i in range(len(y)):
par = y[i]
print('first',i)
z[i] = x[j:par + j].sum()
print('second',j,'par',par)
j = j+par
>> z
0 560
1 500
2 1300
3 1440
4 60
I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)