Counting repeated blocks in pandas - python

I have the following dataframe and I am trying to label an entire block with a number which is based on how many similar blocks has been seen upto now based on class column. Consecutive class value is given the same number. If the same class block comes later, the number will be incremented. If some new class block comes, then it is initialized to 1.
df = DataFrame(zip(range(10,30), range(20)), columns = ['a','b'])
df['Class'] = [np.nan, np.nan, np.nan, np.nan, 'a', 'a', 'a', 'a', np.nan, np.nan,'a', 'a', 'a', 'a', 'a', np.nan, np.nan, 'b', 'b','b']
a b Class
0 10 0 NaN
1 11 1 NaN
2 12 2 NaN
3 13 3 NaN
4 14 4 a
5 15 5 a
6 16 6 a
7 17 7 a
8 18 8 NaN
9 19 9 NaN
10 20 10 a
11 21 11 a
12 22 12 a
13 23 13 a
14 24 14 a
15 25 15 NaN
16 26 16 NaN
17 27 17 b
18 28 18 b
19 29 19 b
Sample output looks like this:
a b Class block_encounter_no
0 10 0 NaN NaN
1 11 1 NaN NaN
2 12 2 NaN NaN
3 13 3 NaN NaN
4 14 4 a 1
5 15 5 a 1
6 16 6 a 1
7 17 7 a 1
8 18 8 NaN NaN
9 19 9 NaN NaN
10 20 10 a 2
11 21 11 a 2
12 22 12 a 2
13 23 13 a 2
14 24 14 a 2
15 25 15 NaN NaN
16 26 16 NaN NaN
17 27 17 b 1
18 28 18 b 1
19 29 19 b 1

Solution with mask:
df['block_encounter_no'] = (df.Class != df.Class.shift()).mask(df.Class.isnull())
.groupby(df.Class).cumsum()
print (df)
a b Class block_encounter_no
0 10 0 NaN NaN
1 11 1 NaN NaN
2 12 2 NaN NaN
3 13 3 NaN NaN
4 14 4 a 1.0
5 15 5 a 1.0
6 16 6 a 1.0
7 17 7 a 1.0
8 18 8 NaN NaN
9 19 9 NaN NaN
10 20 10 a 2.0
11 21 11 a 2.0
12 22 12 a 2.0
13 23 13 a 2.0
14 24 14 a 2.0
15 25 15 NaN NaN
16 26 16 NaN NaN
17 27 17 b 1.0
18 28 18 b 1.0
19 29 19 b 1.0

Do this:
df['block_encounter_no'] = \
np.where(df.Class.notnull(),
(df.Class.notnull() & (df.Class != df.Class.shift())).cumsum(),
np.nan)

Related

Repeat and concatenate a DataFrame with constant step value increase

I have a dataframe like the following example:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
I want to repeat the all dataframe like it was one block,
like I want to repeat the above dataframe 3 times and every element increases by 3 than the original one.
The desired dataframe:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 14 16 19 22
7 8 11 15 17 20 23
8 9 12 16 18 21 24
My real df is like:
0 1 2 3 4 5 6 7 8 9 10 11 12
11 CONECT 12 9 13
12 CONECT 13 12 14 15 16
13 CONECT 14 13
14 CONECT 15 13
15 CONECT 16 13 17 18 19
16 CONECT 17 16
code:
import pandas as pd
df = pd.read_csv('connect_part.txt', 'sample_file.csv', names =['A'])
df = df.A.str.split(expand=True)
df.fillna('', inplace=True)
repeats = 3
step = 3
df1 = df.set_index([0]) # add all non-numeric columns here
df2 = pd.concat([df1+i for i in range(0, len(df1)*repeats, step)]).reset_index()
print(df2)
error:
TypeError: can only concatenate str (not "int") to str
res = pd.concat([df + 3*i for i in range(3)], ignore_index=True)
Output:
>>> res
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 13 16 19 22
7 8 11 14 17 20 23
8 9 12 15 18 21 24
Setup:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9],
'D': [10, 11, 12],
'E': [13, 14, 15],
'F': [16, 17, 18]
})
Assuming df as input, use pandas.concat:
repeats = 3
step = 3
df2 = pd.concat([df+i for i in range(0, len(df)*repeats, step)],
ignore_index=True)
output:
A B C D E F
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3 4 7 10 13 16 19
4 5 8 11 14 17 20
5 6 9 12 15 18 21
6 7 10 13 16 19 22
7 8 11 14 17 20 23
8 9 12 15 18 21 24
update: non-numeric columns:
repeats = 3
step = 3
df1 = df.set_index([0]) # add all non-numeric columns here
df2 = pd.concat([df1+i for i in range(0, len(df1)*repeats, step)]).reset_index()

How to change the default number of top and bottom row

By default, pandas shows you top and bottom 5 rows of a dataframe in jupyter, given that there are too many rows to display:
>>> df.shape
(100, 4)
col0
col1
col2
col3
0
7
17
15
2
1
6
5
5
12
2
10
15
5
15
3
6
19
19
14
4
12
7
4
12
...
...
...
...
...
95
2
14
8
16
96
8
8
5
16
97
6
8
9
1
98
1
5
10
15
99
15
9
1
18
I know that this setting exists:
pd.set_option("display.max_rows", 20)
however, that yields the same result. Using df.head(10) and df.tail(10) in to consecutive cells is an option, but less clean. Same goes for concatenation. Is there another pandas setting like display.max_row for this default view? How can I expand this to let's say the top and bottom 10?
IIUC, use display.min_rows:
pd.set_option("display.min_rows", 20)
print(df)
# Output:
0 1 2 3
0 18 8 12 2
1 2 13 13 14
2 8 7 9 2
3 17 19 9 3
4 14 18 12 3
5 11 5 9 18
6 4 5 12 3
7 12 8 2 7
8 11 2 14 13
9 6 6 3 6
.. .. .. .. ..
90 8 2 1 9
91 7 19 4 6
92 4 3 17 12
93 19 6 5 18
94 3 5 15 5
95 16 3 13 13
96 11 3 18 8
97 1 9 18 4
98 13 10 18 15
99 16 3 5 9
[100 rows x 4 columns]

For a column in pandas dataframe, calculate mean of column values in previous 4th, 8th and 12th row from the present row?

In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.

Multi column filtering in python data frame

I have created a pandas dataframe. I want to filter all with the values 9, 12, 24, 18.
df:
index no1 no2 no3 no4 no5 no6 no7
1 9 11 12 14 18 24 30
2 9 12 13 18 19 24 31
3 9 12 13 42 20 19 24
4 10 9 13 42 18 24 12
5 13 12 13 44 18 24 30
6 2 9 12 18 24 31 44
7 10 12 14 42 18 24 30
8 10 12 14 42 18 24 31
Code:
a = df['no1'].isin([9,12,18 ,24])
b = df['no2'].isin([9,12,18,24])
c = df['no3'].isin([9,12 , 18, 24])
d = df['no4'].isin([9,12 , 18, 24])
e = df['no5'].isin([9,12,18,24])
f = df['no6'].isin([9,12 , 18, 24])
g = df['no7'].isin([9,12 , 18, 24])
df [a & b & c & d & e & f & g]
Desired output:
index no1 no2 no3 no4 no5 no6 no7
1 9 11 12 14 18 24 30
2 9 12 13 18 19 24 31
4 10 9 13 42 18 24 12
6 2 9 12 18 24 31 44
original data frame and expected output
Try:
df[df.isin([9,12,18,24])]
This should give you the exact answer
df=pd.DataFrame({'no1':[9,9,9,10,13,2,10,10],
'no2':[11,12,12,9,12,9,12,12],
'no3':[12,13,13,13,13,12,14,14],
'no4':[14,18,42,42,44,18,42,42],
'no5':[18,19,20,18,18,24,18,18],
'no6':[24,24,19,24,24,31,24,24],
'no7':[30,31,24,12,30,44,30,31]}) # Creating the data frame
df_new=df[df.isin([9,12,18,24])]
df_new=df_new.dropna(thresh=4)
df_new=df_new.fillna(df)
The result would be:
no1 no2 no3 no4 no5 no6 no7
0 9.0 11.0 12.0 14.0 18.0 24.0 30.0
1 9.0 12.0 13.0 18.0 19.0 24.0 31.0
3 10.0 9.0 13.0 42.0 18.0 24.0 12.0
5 2.0 9.0 12.0 18.0 24.0 31.0 44.0

Want to replace value in one column from another column based on specific condition in python

I want to replace ABC with the value of next column for that particular row. If ABC lies in last column then value should be from previous column. If nan is present then it should be not be replaced by any other value. In-fact, we have to do this operation till nan if present.
Dataframe is as given below:
C1 C2 C3 ……. C47 C48 C49 C50
1 0 ABC 15 ……. 29 ABC 90 50
2 ABC ABC 7 ……. 26 10 ABC 30
3 ABC ABC ABC ……. ABC ABC ABC ABC
4 6 20 32 ……. 18 44 ABC ABC
5 2 ABC 24 ……. 16 27 29 ABC
6 23 4 49 ……. 11 52 33 9
7 17 12 2 ……. ABC 31 nan nan
8 ABC nan nan ……. nan nan nan nan
9 34 36 2 ……. 19 ABC nan nan
output should be:
C1 C2 C3 ……. C47 C48 C49 C50
1 0 15 15 ……. 29 90 90 50
2 7 7 7 ……. 26 10 30 30
3 0 0 0 ……. 0 0 0 0
4 6 20 32 ……. 18 44 44 44
5 2 24 24 ……. 16 27 29 29
6 23 4 49 ……. 11 52 33 9
7 17 12 2 ……. 31 31 nan nan
8 0 nan nan ……. nan nan nan nan
9 34 36 2 ……. 19 19 nan nan
Please note ABC will be 0 only when there is no value present in rest of the columns for that particular row.

Categories