Get operation result in dataframe each specific row - python

I have data with 44 rows x 4 column. I want to sum and divide each 11 rows, but In my function my mistake is that I calculate the sum and the division in a whole row.
Please suggest me the simplest solution, maybe using iteration in dataframe ?
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
a = data[['A','B','C','D']].sum()
b = data[['A','B','C','D']] / a
data_div = b.round(4)
Here is an example of what I expect. In the figure below I sum and divide each 4 rows in column A

this looks like what you expect:
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
chunk_len = 11
result = pd.DataFrame()
for i in range(4):
res = data[i*chunk_len:(i+1)*chunk_len]/data[i*chunk_len:(i+1)*chunk_len].sum()
if result.empty:
result = res
else:
result = result.append(res)
print(result)

Assuming I understand your questions correctly, you want to sum your dataframe in groups of 11 rows. One way to do so would be:
result = data.iloc[0:11].sum().sum()
The first .sum() returns the sum of the first 10 rows divided by column, and the second sums up those sums to get the total sum. For different slices of the dataframe you would change the row choice by putting in your desired slice (like data.iloc[11:23] etc.).
The exact same logic would apply for division as well.

You can try to groupby every N rows and then apply the sum:
df.index = [i // 7 for i in range(len(df))]
df['sum_A'] = df["A"].groupby(df.index).sum()
df['div_A'] = df["A"] / df['sum_A']
Full code:
df = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
df.index = [i // 11 for i in range(len(df))] # Define new index for groupby
df['sum_A'] = df["A"].groupby(df.index).sum() # Apply sum per group
df['div_A'] = df["A"] / df['sum_A'] # Divide each row by the given sum
print(df)
# A B C D sum_A div_A
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 2 1 4 4 22 0.090909
# 0 2 1 2 1 22 0.090909
# 1 4 1 2 7 47 0.085106
# 1 5 3 3 8 47 0.106383
# 1 6 5 2 9 47 0.127660
# 1 4 1 2 7 47 0.085106
# 1 5 3 4 8 47 0.106383
# 1 6 5 5 9 47 0.127660
# 1 4 1 6 7 47 0.085106
# 1 5 3 4 8 47 0.106383
# 1 6 5 3 9 47 0.127660
# 1 1 4 6 4 47 0.021277
# 1 1 1 3 2 47 0.021277
# 2 1 4 9 7 32 0.031250
# 2 3 5 7 8 32 0.093750
# 2 5 6 8 9 32 0.156250
# 2 1 1 9 7 32 0.031250
# 2 3 1 4 8 32 0.093750
# 2 5 1 2 9 32 0.156250
# 2 1 3 7 7 32 0.031250
# 2 3 5 8 8 32 0.093750
# 2 5 1 9 9 32 0.156250
# 2 4 3 7 4 32 0.125000
# 2 1 6 8 2 32 0.031250
# 3 7 3 9 2 78 0.089744
# 3 8 9 7 3 78 0.102564
# 3 9 7 8 2 78 0.115385
# 3 7 8 9 2 78 0.089744
# 3 8 9 4 4 78 0.102564
# 3 9 4 2 5 78 0.115385
# 3 7 2 2 6 78 0.089744
# 3 8 7 1 4 78 0.102564
# 3 9 8 3 3 78 0.115385
# 3 4 9 5 6 78 0.051282
# 3 2 2 4 3 78 0.025641
Hope that helps !

Related

Adding element of a range of values to every N rows in a pandas DataFrame

I have the following dataframe that is ordered and consecutive:
Hour value
0 1 41
1 2 5
2 3 7
3 4 107
4 5 56
5 6 64
6 7 46
7 8 50
8 9 95
9 10 81
10 11 8
11 12 94
I want to add a range of values to each N rows (4 in this case), e.g.:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
Using numpy.arange:
import numpy as np
df['val'] = np.arange(len(df))//4+1
Output:
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3
IIUC, you can create val column based from the index as follows:
df['val'] = 1 + df.index//4
print(df)
Output
Hour value val
0 1 41 1
1 2 5 1
2 3 7 1
3 4 107 1
4 5 56 2
5 6 64 2
6 7 46 2
7 8 50 2
8 9 95 3
9 10 81 3
10 11 8 3
11 12 94 3

Create a counter that iterates over a column in a dataframe, and counts when a condition in the column is met

I currently have a column in my dataframe called step, that I want to use to set a counter on. It contains a bunch of repeating numbers. I want to create a new column against this, that has a counter that increments when a certain condition is met. The condition is when the number changes for a fourth time in the column step, the counter will increment by 1, and then repeat the process. Here is an example of my code, and what I'd like to acheive:
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7]})
df['counter'] = df['step'].cumsum() #This will increment when it sees a fourth different number, and repeat
So ideally, my output would look like this:
print(df['step'])
[1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,
7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]
print(df['counter'])
[0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,
3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5]
The numbers in step will vary, but the counter will always increment when the fourth different value in the sequence is identified and reset the counter. I know I could probably do this with if statements, but my dataframe is large and I would rather do it in a faster way of comparison, if possible. Any help would be greatly appreciated!
You can convert your step column into categories and then count on the category codes:
import pandas as pd
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,10]})
df["counter"] = df.step.astype("category").values.codes // 3
Result:
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 10 3
Update for changed data (see comment):
df = pd.DataFrame({"step": [1,1,1,2,2,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,7,7,7,7,8,8,8,8,8,9,9,9,9,7,7,7,8,8,8,9,9,7,7,8,8,8,9,9,9,7,7]})
df['counter'] = (df.step.diff().fillna(0).ne(0).cumsum() // 3).astype(int)
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5
50 7 5
Compare the current and previous row in step column to identify boundaries(location of transitions), then use cumsum to assign number to groups of rows and floor divide by 3 to create counter
m = df.step != df.step.shift()
df['counter'] = (m.cumsum() - 1) // 3
step counter
0 1 0
1 1 0
2 1 0
3 2 0
4 2 0
5 2 0
6 2 0
7 3 0
8 3 0
9 3 0
10 4 1
11 4 1
12 4 1
13 5 1
14 5 1
15 5 1
16 5 1
17 6 1
18 6 1
19 6 1
20 7 2
21 7 2
22 7 2
23 7 2
24 8 2
25 8 2
26 8 2
27 8 2
28 8 2
29 9 2
30 9 2
31 9 2
32 9 2
33 7 3
34 7 3
35 7 3
36 8 3
37 8 3
38 8 3
39 9 3
40 9 3
41 7 4
42 7 4
43 8 4
44 8 4
45 8 4
46 9 4
47 9 4
48 9 4
49 7 5

Fill values from one dataframe into another dataframe based on index of the two

I have two dataframes that contain time series data.
Dataframe A contains data of timestep 1, with index values getting incremented by 1 each time.
Dataframe B contains data of timestep n, with index values getting incremented by n each time.
I wish to do the following:
Add a column in Dataframe A and fill values from Dataframe B such that if the index value of that row in A lies between that of consecutive indexes in B, I fill the same value for all such rows in A.
I will illustrate this as below:
A:
id val1
0 2
1 3
2 4
3 1
4 6
5 23
6 2
7 12
8 56
9 34
10 90
...
B:
id tval
0 1
3 5
6 9
9 34
12 3434
...
Now, my result should like the following:
A:
id val1 tval
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
...
I would like to automate this for any n.
Use merge_asof:
df = pd.merge_asof(A, B, left_index=True, right_index=True)
print (df)
val1 tval
id
0 2 1
1 3 1
2 4 1
3 1 5
4 6 5
5 23 5
6 2 9
7 12 9
8 56 9
9 34 34
10 90 34
If id is columns:
df = pd.merge_asof(A, B, on='id')
print (df)
id val1 tval
0 0 2 1
1 1 3 1
2 2 4 1
3 3 1 5
4 4 6 5
5 5 23 5
6 6 2 9
7 7 12 9
8 8 56 9
9 9 34 34
10 10 90 34

How can I create a new column in a DataFrame that shows patterns in a different column?

My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.

Dropping different possible combination of values from a column in pandas dataframe iteratively

I have a data frame as shown below:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'ouptput':[10,11,12,13,14]})
Data
Yields,
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6 14
I want to loop through the data to remove n number of values from the column 'output' in above Data, where n = [1,2,3,4] and assign it to a new data frame 'Test_Data'. For example if I assign n = 2 the function should produce
Test_Data - iteration 1 as
L1 L2 ouptput
0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
Test_Data - iteration 2 as
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
like wise it should produce a data frame with 2 values removed from the 'output' column in data frame. It should produce a new output (new combination) everytime. No output should be repeated. Also I should have control over the number of iterations. Eample 5c3 has 10 possible combinations. But I should be able to stop it at 8 iterations.
This is not a great solution but will probably help you achieve what you are looking for:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'output':[10,11,12,13,14]})
num_iterations = 1
num_values = 3
for i in range(0, num_iterations):
tmp_data = Data.copy()
tmp_data.loc[i*num_values:num_values*(i+1)-1, 'output'] = ''
print tmp_data
This gives you a concatenated dataframe with every combination using pd.concat and itertools.combinations
from itertools import combinations
import pandas as pd
def mask(df, col, idx):
d = df.copy()
d.loc[list(idx), col] = ''
return d
n = 2
pd.concat({c: mask(Data, 'ouptput', c) for c in combinations(Data.index, n)})
L1 L2 ouptput
0 1 0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
2 0 1 6
1 2 7 11
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6
1 2 7 11
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6
1 2 0 1 6 10
1 2 7
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6 10
1 2 7
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7
2 3 3 12
3 4 5 13
4 5 6
2 3 0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7 11
2 3 3
3 4 5 13
4 5 6
3 4 0 1 6 10
1 2 7 11
2 3 3 12
3 4 5
4 5 6

Categories