How to create a counter based on another column? - python

I've created this data frame -
Range = np.arange(0,9,1)
A={
0:2,
1:2,
2:2,
3:2,
4:3,
5:3,
6:3,
7:2,
8:2
}
Table = pd.DataFrame({"Row": Range})
Table["Intervals"]=(Table["Row"]%9).map(A)
Table
Row Intervals
0 0 2
1 1 2
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 2
8 8 2
I'd like to create another column that will be based on the intervals columns and will act as sort of a counter - so the values will be 1,2,1,2,1,2,3,1,2.
The logic is that I want to count by the value of the intervals column.
I've tried to use group by but the issue is that the values are displayed multiple times.
Logic:
We have 2 different values - 2 and 3. Each value will occur in the intervals column as the value itself - so 2 for example will occur twice 2,2. And 3 will occur 3 times - 3,3,3.
For the first 4 rows, the value 2 is displayed twice - that is why the new column should be 1,2 (counter of the first 2) and then again 1,2 (counter of the second 2).
Afterward, there is 3, so the values are 1,2,3.
And then once again 2, so the values are 1,2.
Hope I managed to explain myself.
Thanks in advance!

You can use groupby.cumcount combined with mod:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = Table.groupby(group).cumcount().mod(Table['Intervals']).add(1)
Or:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = (Table.groupby(group)['Intervals']
.transform(lambda s: np.arange(len(s))%s.iloc[0]+1)
)
Output:
Row Intervals Counter
0 0 2 1
1 1 2 2
2 2 2 1
3 3 2 2
4 4 3 1
5 5 3 2
6 6 3 3
7 7 2 1
8 8 2 2

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Selecting pandas dataframe rows with same pair of two column values and different on third for certain number of counts

I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment
totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations
missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists.
Thanks
df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3
df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index
Begin
End
Expt
0
2
5
C_R1
1
2
5
C_R2
2
2
5
C_R3
3
2
5
T_R1
4
2
5
T_R2
5
2
5
T_R3

How to explode aggregated pandas column

I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1
From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1

Pandas enumerate groups in descending order

I've the following column:
column
0 10
1 10
2 8
3 8
4 6
5 6
My goal is to find the today unique values (3 in this case) and create a new column which would create the following
new_column
0 3
1 3
2 2
3 2
4 1
5 1
The numbering starts from length of unique values (3) and same number is repeated if current row is same as previous row based on original column. Number gets decreased as row value changes. All unique values in original column have same number of rows (2 rows for each unique value in this case).
My solution was to groupby the original column and create a new list like below:
i=1
new_time=[]
for j, v in df.groupby('column'):
new_time.append([i]*2)
i=i+1
Then I'd flatten the list sort in decreasing order. Any other simpler solution?
Thanks.
pd.factorize
i, u = pd.factorize(df.column)
df.assign(new=len(u) - i)
column new
0 10 3
1 10 3
2 8 2
3 8 2
4 6 1
5 6 1
dict.setdefault
d = {}
for k in df.column:
d.setdefault(k, len(d))
df.assign(new=len(d) - df.column.map(d))
Use GroupBy.ngroup with ascending=False:
df.groupby('column', sort=False).ngroup(ascending=False)+1
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
For DataFrame that looks like this,
df = pd.DataFrame({'column': [10, 10, 8, 8, 10, 10]})
. . .where only consecutive values are to be grouped, you'll need to modify your grouper:
(df.groupby(df['column'].ne(df['column'].shift()).cumsum(), sort=False)
.ngroup(ascending=False)
.add(1))
0 3
1 3
2 2
3 2
4 1
5 1
dtype: int64
Acutally, we can use rank with method being dense i.e
dense: like ‘min’, but rank always increases by 1 between groups
df['column'].rank(method='dense')
0 3.0
1 3.0
2 2.0
3 2.0
4 1.0
5 1.0
rank version of #cs95's solution would be
df['column'].ne(df['column'].shift()).cumsum().rank(method='dense',ascending=False)
Try with unique and map
df.column.map(dict(zip(df.column.unique(),reversed(range(df.column.nunique())))))+1
Out[350]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int64
IIUC, you want groupID of same-values consecutive groups in reversed order. If so, I think this should work too:
df.column.nunique() - df.column.ne(df.column.shift()).cumsum().sub(1)
Out[691]:
0 3
1 3
2 2
3 2
4 1
5 1
Name: column, dtype: int32

Set value to slice of a Pandas dataframe

I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2

Categories