Python: create new column conditionally on values from two other columns - python

I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!

If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2

if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2

Related

Pandas group consecutive and label the length

I want get consecutive length labeled data
a
---
1
0
1
0
1
1
1
0
1
1
I want :
a | c
--------
1 1
0 0
1 2
1 2
0 0
1 3
1 3
1 3
0 0
1 2
1 2
then I can calculate the mean of "b" column by group "c". tried with shift and cumsum and cumcount all not work.
Use GroupBy.transform by consecutive groups and then set 0 if not 1 in a column:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.where(df.a.eq(1), 0))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
If there are only 0, 1 values is possible multiple by a:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.mul(df.a))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2

Duplicate a selected row and put the duplicate just below in a Pandas DataFrame

I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7

How to keep track of how many times a unique condition occurs

I have a df that looks like this:
time val
0 1
1 1
2 2
3 3
4 1
5 2
6 3
7 3
8 3
9 3
10 1
11 1
How do I create new columns that hold the amount of times a condition occurs and does not change? In this case, I want to create a column for each unique value in val that holds the cumulative sum at the given row of occurences, but does not increment the value if the condition doesn't change.
Expected outcome below:
time val sum_1 sum_2 sum_3
0 1 1 0 0
1 1 1 0 0
2 2 1 1 0
3 3 1 1 1
4 1 2 1 1
5 2 2 2 1
6 3 2 2 2
7 3 2 2 2
8 3 2 2 2
9 3 2 2 2
10 1 3 2 2
11 1 3 2 2
EDIT
To be more specific with the condition:
I want to count the number of times a unique value appears in val. For example, using the code below, I could get this result:
df['sum_1'] = (df['val'] == 1).cumsum()
df['sum_2'] = (df['val'] == 2).cumsum()
df['sum_3'] = (df['val'] == 3).cumsum()
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 2 0 0
2 2 2 2 1 0
3 3 3 2 1 1
4 4 1 3 1 1
5 5 2 3 2 1
However, this code counts EVERY occurence of a condition. For example, val shows 1 occurring 3 times total. However, I want to treat consecutive occurrences of 1 as a single group, counting only the number of consecutive groupings that occur. In the example above, 1 occurs in total 3 times, but only 2 times as a consecutive grouping.
You can chain mask by & for bitwise AND for test first consecutive values by compare by shifted values by Series.ne with Series.shift and run code for test all unique values of column val:
uniq = df['val'].unique()
m = df['val'].ne(df['val'].shift())
for c in uniq:
df[f'sum_{c}'] = (df['val'].eq(c) & m).cumsum()
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
For better performance (I hope) here is numpy alternative:
a = df['val'].to_numpy()
uniq = np.unique(a)
m = np.concatenate(([False], a[:-1])) != a
arr = np.cumsum((a[:, None] == uniq) & m[:, None], axis=0)
df = df.join(pd.DataFrame(arr, index=df.index, columns=uniq).add_prefix('sum_'))
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2

Pandas: find duplicates in another dataframe based on a subset

Assume DF 1:
A B C
0 1 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 9 9 9
And DF 2
A B C
0 6 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 1 9 6
I would like to add a column to DF 1 with a count of duplicates in DF 2 based on a subset of columns:
For example
Duplicate on
1
2
Result:
A B C Dupe
0 1 1 1 1
1 1 1 2 1
2 2 1 1 1
3 1 9 0 2
4 9 9 9 0
Sound like you should groupby by df2 then merge
df=df1.merge(df2.groupby(['A','B']).size().to_frame('DUP').reset_index(),how='left').fillna(0)
A B C DUP
0 1 1 1 1.0
1 1 1 2 1.0
2 2 1 1 1.0
3 1 9 0 2.0
4 9 9 9 0.0

Python: Replace a cell value in Dataframe with if statement

I have a matrix with that looks like this:
com 0 1 2 3 4 5
AAA 0 5 0 4 2 1 4
ABC 0 9 8 9 1 0 3
ADE 1 4 3 5 1 0 1
BCD 1 6 7 8 3 4 1
BCF 2 3 4 2 1 3 0 ...
Where AAA, ABC ... is the dataframe index. The dataframe columns are com 0 1 3 4 5 6
I want to set the cell values in my dataframe equal to 0 when the row values of com is equal the column "number". So for instance, the above matrix will look like:
com 0 1 2 3 4 5
AAA 0 0 0 4 2 1 4
ABC 0 0 8 9 1 0 3
ADE 1 4 0 5 1 0 1
BCD 1 6 0 8 3 4 1
BCF 2 3 4 0 1 3 0 ...
I tried to iterate over rows and use both .loc and .ix but no success.
Just require some numpy trick
In [22]:
print df
0 1 2 3 4 5
0 5 0 4 2 1 4
0 9 8 9 1 0 3
1 4 3 5 1 0 1
1 6 7 8 3 4 1
2 3 4 2 1 3 0
[5 rows x 6 columns]
In [23]:
#making a masking matrix, 0 where column and index values equal, 1 elsewhere, kind of the vectorized way of doing if TURE 0, else 1
print df*np.where(df.columns.values==df.index.values[..., np.newaxis], 0,1)
0 1 2 3 4 5
0 0 0 4 2 1 4
0 0 8 9 1 0 3
1 4 0 5 1 0 1
1 6 0 8 3 4 1
2 3 4 0 1 3 0
[5 rows x 6 columns]
I think this should work.
for line in range(len(matrix)):
matrix[matrix[line][0]+1]=0
NOTE
Depending on your matrix setup you may not need the +1
Basically it takes the first digit of each line in the matrix and uses that as the index of the value to change to 0
i.e. if the row was
c 0 1 2 3 4 5
AAA 4 3 2 3 9 5 9,
it would change the 5 below the number 4 to 0
c 0 1 2 3 4 5
AAA 4 3 2 3 9 0 9

Categories