Pandas sum of variable number of columns - python

I have a pandas dataframe like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
The Time columns gives me the number of A columns that I need to sum up.So that the output looks like this -
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A Total
5 10 4 6 6 4 6 4 30
3 7 19 2 7 7 9 18 28
6 3 6 3 3 8 10 56 33
2 5 9 1 1 9 12 13 14
In other words, when the value in Time column is 3, it should sum up 1A, 2A and 3A
when the value in Time column is 5, it should sum up 1A, 2A, 3A, 4A and 5A
Note: There are other columns also in between the As. So I cant sum using simple indexing.
Highly appreciate any help in finding a solution.

Use numpy - idea is compare array created by np.arange with length of columns with Time columns converted to index with broadcasting to 2d mask, get matched values by numpy.where and last sum:
df1 = df.set_index('Time')
m = np.arange(len(df1.columns)) < df1.index.values[:, None]
df['new'] = np.where(m, df1.values, 0).sum(axis=1)
print (df)
Time 1 A 2 A 3 A 4 A 5 A 6 A 100 A new
0 5 10 4 6 6 4 6 4 30
1 3 7 19 2 7 7 9 18 28
2 6 3 6 3 3 8 10 56 33
3 2 5 9 1 1 9 12 13 14
Details:
print (df1)
1 A 2 A 3 A 4 A 5 A 6 A 100 A
Time
5 10 4 6 6 4 6 4
3 7 19 2 7 7 9 18
6 3 6 3 3 8 10 56
2 5 9 1 1 9 12 13
print (m)
[[ True True True True True False False]
[ True True True False False False False]
[ True True True True True True False]
[ True True False False False False False]]
print (np.where(m, df1.values, 0))
[[10 4 6 6 4 0 0]
[ 7 19 2 0 0 0 0]
[ 3 6 3 3 8 10 0]
[ 5 9 0 0 0 0 0]]

Try:
df['total'] = df.apply(lambda x: sum([x[i+1] for i in range(x['Time'])]), axis=1)

Related

Find index of first row whose value matches a condition set by another row

I have a dataframe which consists of two columns:
x y
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
I would like to add a column whose value is the index of the first value to meet the following condition: y >= x. For example, for row 2 (x = 3), the first y value greater or equal to 3 is 4 so the output of z for row 2 is (index) 1. I expect the final table to look like:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4
It should be noted that both x and y are sorted if that should make the solution easier.
I have seen a similar answer but I could not translate it to my situation.
You want np.searchsorted, which assumes df['y'] is sorted:
df['z'] = np.searchsorted(df['y'], df['x'])
Output:
x y z
0 1 2 0
1 2 4 0
2 3 6 1
3 4 8 1
4 5 10 2
5 6 12 2
6 7 14 3
7 8 16 3
8 9 18 4
9 10 20 4

df.apply() but skip the first row

I am trying to apply the following df.apply command to a dataframe but want it to skip the first row. Any advice on how to do that without setting the first row as the column headers?
res = sheet1[sheet1.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
You can select from the index one and on as follow:
res = sheet1[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
EDIT Version 3:
import pandas as pd
import random
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2),
'd':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df[df.apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)]
print (res.loc[1:])
If all you want to do is to get only the rows from 1 onwards, you can just do it as shown above:
The input Dataframe is:
a b c d
0 1 2 1 TRUE
1 2 4 3 FALSE
2 3 6 5 FALSE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
6 7 14 13 FALSE
7 8 16 15 FALSE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res will be:
a b c d
0 1 2 1 TRUE
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
The output of res[1:] - excluding first row will be:
a b c d
3 4 8 7 TRUE
4 5 10 9 TRUE
5 6 12 11 TRUE
8 9 18 17 TRUE
9 10 20 19 TRUE
EDIT Version 2:
Here's an example with 'TRUE' and 'FALSE' in the column.
import pandas as pd
import random
df = pd.DataFrame({'a':['TRUE' if random.randint(0,1) else 'FALSE' for _ in range(10)]})
print (df)
res = df.iloc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
print (res)
The output will be:
Original DataFrame:
a
0 TRUE
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 TRUE
6 TRUE
7 FALSE
8 FALSE
9 FALSE
Result from the DataFrame:
1 True
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
You can also give loc instead of iloc:
res = df.loc[1:].apply(lambda row: row.astype(str).str.contains('TRUE', case=False).any(), axis=1)
As you can see, it skipped the first row.
Old answer
Here's an example:
import pandas as pd
df = pd.DataFrame({'a':range(1,11), 'b':range(2,21,2), 'c':range(1,20,2)})
print (df)
res = df.iloc[1:,:].apply(lambda x: x+10,axis=1)
print (res)
Original DataFrame:
a b c
0 1 2 1
1 2 4 3
2 3 6 5
3 4 8 7
4 5 10 9
5 6 12 11
6 7 14 13
7 8 16 15
8 9 18 17
9 10 20 19
Only rows 1 onwards got modified:
a b c
1 12 14 13
2 13 16 15
3 14 18 17
4 15 20 19
5 16 22 21
6 17 24 23
7 18 26 25
8 19 28 27
9 20 30 29

How to drop duplicates in python if consecutive values are the same in two columns?

I have a dataframe like below:
A B C
1 8 23
2 8 22
3 9 45
4 9 45
5 6 12
6 4 10
7 11 12
I want to drop duplicates where keep the first value in the consecutive occurence if the C is also the same.
E.G here occurence '9' is column B is repetitive and their correponding occurences in column 'C' is also repetitive '45'. In this case i want to retain the first occurence.
Expected Output:
A B C
1 8 23
2 8 22
3 9 45
5 6 12
6 4 10
7 11 12
I tried some group by, but didnot know how to drop.
code:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
test=df.groupby('consecutive',as_index=False).apply(lambda x: (x['B'].head(1),x.shape[0],
x['C'].iloc[-1] - x['C'].iloc[0]))
This group by returns me a series, but i want to drop.
Add DataFrame.drop_duplicates by 2 columns:
df['consecutive'] = (df['B'] != df['B'].shift(1)).cumsum()
df = df.drop_duplicates(['consecutive','C'])
print (df)
A B C consecutive
0 1 8 23 1
1 2 8 22 1
2 3 9 45 2
4 5 6 12 3
5 6 4 10 4
6 7 11 12 5
Or chain both conditions with | for bitwise OR:
df = df[(df['B'] != df['B'].shift()) | (df['C'] != df['C'].shift())]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
the easy way to check the difference between row of B and C then drop value if difference is 0 (duplicate values), the code is
df[ ~((df.B.diff()==0) & (df.C.diff()==0)) ]
A oneliner to filter out such records is:
df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
Here we thus check if the columns ['B', 'C'] is the same as the shifted rows, if it is not, we retain the values:
>>> df[(df[['B', 'C']].shift() != df[['B', 'C']]).any(axis=1)]
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
This is quite scalable, since we can define a function that will easily operate on an arbitrary number of values:
def drop_consecutive_duplicates(df, *colnames):
dff = df[list(colnames)]
return df[(dff.shift() != dff).any(axis=1)]
So you can then filter with:
drop_consecutive_duplicates(df, 'B', 'C')
Using diff, ne and any over axis=1:
Note: this method only works for numeric columns
m = df[['B', 'C']].diff().ne(0).any(axis=1)
print(df[m])
Output
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Details
df[['B', 'C']].diff()
B C
0 NaN NaN
1 0.0 -1.0
2 1.0 23.0
3 0.0 0.0
4 -3.0 -33.0
5 -2.0 -2.0
6 7.0 2.0
Then we check if any of the values in a row are not equal (ne) to 0:
df[['B', 'C']].diff().ne(0).any(axis=1)
0 True
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
You can compute a series of the rows to drop, and then drop them:
to_drop = (df['B'] == df['B'].shift())&(df['C']==df['C'].shift())
df = df[~to_drop]
It gives as expected:
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
Code
df1 = df.drop_duplicates(subset=['B', 'C'])
Result
A B C
0 1 8 23
1 2 8 22
2 3 9 45
4 5 6 12
5 6 4 10
6 7 11 12
If I understand your question correctly, given the following dataframe:
df = pd.DataFrame({'B': [8, 8, 9, 9, 6, 4, 11], 'C': [22, 23, 45, 45, 12, 10, 12],})
This one-line code solved your problem using the drop_duplicates method:
df.drop_duplicates(['B', 'C'])
It gives as expected results:
B C
0 8 22
1 8 23
2 9 45
4 6 12
5 4 10
6 11 12

Get operation result in dataframe each specific row

I have data with 44 rows x 4 column. I want to sum and divide each 11 rows, but In my function my mistake is that I calculate the sum and the division in a whole row.
Please suggest me the simplest solution, maybe using iteration in dataframe ?
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
a = data[['A','B','C','D']].sum()
b = data[['A','B','C','D']] / a
data_div = b.round(4)
Here is an example of what I expect. In the figure below I sum and divide each 4 rows in column A
this looks like what you expect:
import pandas as pd
data = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
chunk_len = 11
result = pd.DataFrame()
for i in range(4):
res = data[i*chunk_len:(i+1)*chunk_len]/data[i*chunk_len:(i+1)*chunk_len].sum()
if result.empty:
result = res
else:
result = result.append(res)
print(result)
Assuming I understand your questions correctly, you want to sum your dataframe in groups of 11 rows. One way to do so would be:
result = data.iloc[0:11].sum().sum()
The first .sum() returns the sum of the first 10 rows divided by column, and the second sums up those sums to get the total sum. For different slices of the dataframe you would change the row choice by putting in your desired slice (like data.iloc[11:23] etc.).
The exact same logic would apply for division as well.
You can try to groupby every N rows and then apply the sum:
df.index = [i // 7 for i in range(len(df))]
df['sum_A'] = df["A"].groupby(df.index).sum()
df['div_A'] = df["A"] / df['sum_A']
Full code:
df = pd.DataFrame({'A':[1,2,3,1,2,3,1,2,3,2,2,4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2],
'B':[4,5,6,4,5,6,4,5,6,1,1,1,3,5,1,3,5,1,3,5,4,1,4,5,6,1,1,1,3,5,1,3,6,3,9,7,8,9,4,2,7,8,9,2],
'C':[7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,1,3,5,4],
'D':[1,3,5,1,3,5,1,3,5,4,1,7,8,9,7,8,9,7,8,9,4,2,7,8,9,7,8,9,7,8,9,4,2,2,3,2,2,4,5,6,4,3,6,3]}
)
df.index = [i // 11 for i in range(len(df))] # Define new index for groupby
df['sum_A'] = df["A"].groupby(df.index).sum() # Apply sum per group
df['div_A'] = df["A"] / df['sum_A'] # Divide each row by the given sum
print(df)
# A B C D sum_A div_A
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 1 4 7 1 22 0.045455
# 0 2 5 8 3 22 0.090909
# 0 3 6 9 5 22 0.136364
# 0 2 1 4 4 22 0.090909
# 0 2 1 2 1 22 0.090909
# 1 4 1 2 7 47 0.085106
# 1 5 3 3 8 47 0.106383
# 1 6 5 2 9 47 0.127660
# 1 4 1 2 7 47 0.085106
# 1 5 3 4 8 47 0.106383
# 1 6 5 5 9 47 0.127660
# 1 4 1 6 7 47 0.085106
# 1 5 3 4 8 47 0.106383
# 1 6 5 3 9 47 0.127660
# 1 1 4 6 4 47 0.021277
# 1 1 1 3 2 47 0.021277
# 2 1 4 9 7 32 0.031250
# 2 3 5 7 8 32 0.093750
# 2 5 6 8 9 32 0.156250
# 2 1 1 9 7 32 0.031250
# 2 3 1 4 8 32 0.093750
# 2 5 1 2 9 32 0.156250
# 2 1 3 7 7 32 0.031250
# 2 3 5 8 8 32 0.093750
# 2 5 1 9 9 32 0.156250
# 2 4 3 7 4 32 0.125000
# 2 1 6 8 2 32 0.031250
# 3 7 3 9 2 78 0.089744
# 3 8 9 7 3 78 0.102564
# 3 9 7 8 2 78 0.115385
# 3 7 8 9 2 78 0.089744
# 3 8 9 4 4 78 0.102564
# 3 9 4 2 5 78 0.115385
# 3 7 2 2 6 78 0.089744
# 3 8 7 1 4 78 0.102564
# 3 9 8 3 3 78 0.115385
# 3 4 9 5 6 78 0.051282
# 3 2 2 4 3 78 0.025641
Hope that helps !

How can I create a new column in a DataFrame that shows patterns in a different column?

My original CSV file looks like this
1, 9
2, 8
3, 9
14, 7
15, 6
19, 8
20, 9
21, 3
I grouped the table for continuous integers in column A with
for grp, val in df.groupby((df.diff()-1).fillna(0).cumsum().index1):
print(val)
Resulting table:
A B
1 1 9
2 2 8
3 3 9
A B
14 14 7
15 15 6
A B
19 19 8
20 20 9
21 21 3
In practice the B values are very long ID numbers, but insignificant as numbers. How can I create a new column C that will show patterns in each of the three groups by assigning a simple value to each ID, and the same simple value for each duplicate in a group?
Desired output:
A B C
1 1 9 1
2 2 8 2
3 3 9 1
A B C
14 14 7 1
15 15 6 2
A B C
19 19 8 1
20 20 9 2
21 21 3 3
Thanks
You are close
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : pd.Series(pd.factorize(x)[0]+1)).values
df
Out[105]:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
3 14 7 1
4 15 6 2
5 19 8 1
6 20 9 2
7 21 3 3
Or using category
df['C']=df.groupby((df.A.diff()-1).fillna(0).cumsum()).B.apply(lambda x : x.astype('category').cat.codes+1).values
df
Out[110]:
A B C
0 1 9 2
1 2 8 1
2 3 9 2
3 14 7 2
4 15 6 1
5 19 8 2
6 20 9 3
7 21 3 1
if you need for loop
for x,df1 in df.groupby((df.A.diff()-1).fillna(0).cumsum()):
print(df1.assign(C=pd.factorize(df1.B)[0]+1))
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Let's try:
df.columns = ['A','B']
g = df.groupby(df.A.diff().ne(1).cumsum())
df['C'] = g['B'].transform(lambda x: pd.factorize(x)[0] + 1)
for n,g in g:
print(g)
Output:
A B C
0 1 9 1
1 2 8 2
2 3 9 1
A B C
3 14 7 1
4 15 6 2
A B C
5 19 8 1
6 20 9 2
7 21 3 3
Try withColumn function that will add a new column to the dataframe and you may assign an index value.

Categories