How to fill rows with missing combinations pandas - python

I have the following pandas dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,2,2,2,3,3,3,3,3], 'time': [2,3,5,1,3,4,1,2,6,7,8],
'val':['a','a','a','a','a','a','a','a','a','a','a']})
id time val
0 1 2 a
1 1 3 a
2 1 5 a
3 2 1 a
4 2 3 a
5 2 4 a
6 3 1 a
7 3 2 a
8 3 6 a
9 3 7 a
10 3 8 a
I would like for each id, to add a row, for each missing time, where the val would be 'b'. time would start from 1
The resulting dataframe would look like this
foo = pd.DataFrame({'id': [1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3,3], 'time': [1,2,3,4,5,1,2,3,4,1,2,3,4,5,6,7,8],
'val':['b','a','a','b','a','a','b','a','a','a','a','b','b','b','a','a','a']})
id time val
0 1 1 b
1 1 2 a
2 1 3 a
3 1 4 b
4 1 5 a
5 2 1 a
6 2 2 b
7 2 3 a
8 2 4 a
9 3 1 a
10 3 2 a
11 3 3 b
12 3 4 b
13 3 5 b
14 3 6 a
15 3 7 a
16 3 8 a
Any ideas how I could do that in python ?
This answer does not work, because it does not take into account the groupby id and also the fact that for id == 1, i am missing the time == 1

Set the index of dataframe to time then reindex the time column per id and fill the NaN values in val column with b
(
foo
.set_index('time').groupby('id')
.apply(lambda g: g.reindex(range(1, g.index.max() + 1)))
.drop('id', axis=1).fillna({'val': 'b'}).reset_index()
)
If you want to try something :fancy:, here is another solution:
(
foo.groupby('id')['time'].max()
.map(range).explode().add(1).reset_index(name='time')
.merge(foo, how='left').fillna({'val': 'b'})
)
id time val
0 1 1 b
1 1 2 a
2 1 3 a
3 1 4 b
4 1 5 a
5 2 1 a
6 2 2 b
7 2 3 a
8 2 4 a
9 3 1 a
10 3 2 a
11 3 3 b
12 3 4 b
13 3 5 b
14 3 6 a
15 3 7 a
16 3 8 a

One option is with complete from pyjanitor :
# pip install pyjanitor
import pandas as pd
import janitor
# build a range of numbers for each group, starting from 1
new_time = {'time': lambda df: range(1, df.max() + 1)}
foo.complete(new_time, by = 'id', fill_value = 'b')
id time val
0 1 1 b
1 1 2 a
2 1 3 a
3 1 4 b
4 1 5 a
5 2 1 a
6 2 2 b
7 2 3 a
8 2 4 a
9 3 1 a
10 3 2 a
11 3 3 b
12 3 4 b
13 3 5 b
14 3 6 a
15 3 7 a
16 3 8 a

Related

How to keep counting although it start at 1 again

My df looks as follows:
import pandas as pd
d = {'col1': [1,2,3,3,1,2,2,3,4,1,1,2]
df= pd.DataFrame(data=d)
Now I want to add a new column with the following schemata:
col1
new_col
1
1
2
2
3
3
3
3
3
3
1
4
2
5
2
5
3
6
4
7
1
8
1
8
2
9
Once it starts again at 1 it should just keep counting.
At the moment I am at the point where I just add a column with difference:
df['diff'] = df['col1'].diff()
How to extend this approach?
Try with
df.col1.diff().ne(0).cumsum()
Out[94]:
0 1
1 2
2 3
3 3
4 4
5 5
6 5
7 6
8 7
9 8
10 8
11 9
Name: col1, dtype: int32
Try:
df["new_col"] = df["col1"].ne(df["col1"].shift()).cumsum()
>>> df
col1 new_col
0 1 1
1 2 2
2 3 3
3 3 3
4 1 4
5 2 5
6 2 5
7 3 6
8 4 7
9 1 8
10 1 8
11 2 9

Column that counts up within subgroups pandas

I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)

Python dataframe add columns in groups of 3

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

need to filter rows present in one dataframe on another

I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?
You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1

Dropping different possible combination of values from a column in pandas dataframe iteratively

I have a data frame as shown below:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'ouptput':[10,11,12,13,14]})
Data
Yields,
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6 14
I want to loop through the data to remove n number of values from the column 'output' in above Data, where n = [1,2,3,4] and assign it to a new data frame 'Test_Data'. For example if I assign n = 2 the function should produce
Test_Data - iteration 1 as
L1 L2 ouptput
0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
Test_Data - iteration 2 as
L1 L2 ouptput
0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
like wise it should produce a data frame with 2 values removed from the 'output' column in data frame. It should produce a new output (new combination) everytime. No output should be repeated. Also I should have control over the number of iterations. Eample 5c3 has 10 possible combinations. But I should be able to stop it at 8 iterations.
This is not a great solution but will probably help you achieve what you are looking for:
import pandas as pd
Data = pd.DataFrame({'L1': [1,2,3,4,5], 'L2': [6,7,3,5,6], 'output':[10,11,12,13,14]})
num_iterations = 1
num_values = 3
for i in range(0, num_iterations):
tmp_data = Data.copy()
tmp_data.loc[i*num_values:num_values*(i+1)-1, 'output'] = ''
print tmp_data
This gives you a concatenated dataframe with every combination using pd.concat and itertools.combinations
from itertools import combinations
import pandas as pd
def mask(df, col, idx):
d = df.copy()
d.loc[list(idx), col] = ''
return d
n = 2
pd.concat({c: mask(Data, 'ouptput', c) for c in combinations(Data.index, n)})
L1 L2 ouptput
0 1 0 1 6
1 2 7
2 3 3 12
3 4 5 13
4 5 6 14
2 0 1 6
1 2 7 11
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6
1 2 7 11
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6
1 2 7 11
2 3 3 12
3 4 5 13
4 5 6
1 2 0 1 6 10
1 2 7
2 3 3
3 4 5 13
4 5 6 14
3 0 1 6 10
1 2 7
2 3 3 12
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7
2 3 3 12
3 4 5 13
4 5 6
2 3 0 1 6 10
1 2 7 11
2 3 3
3 4 5
4 5 6 14
4 0 1 6 10
1 2 7 11
2 3 3
3 4 5 13
4 5 6
3 4 0 1 6 10
1 2 7 11
2 3 3 12
3 4 5
4 5 6

Categories