How to merge rows with combination of values in a DataFrame - python

I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated

A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0

Related

How to count number of rows dropped in a pandas dataframe

How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7

Combine list with dataframe and extend the multi-values cells into columns

Assuming I have the following input:
table = pd.DataFrame({'a':[0,0,0,0],'b':[1,1,1,3,],'c':[2,2,5,4],'d':[3,6,6,6]},dtype='float64')
list = [[55,66],
[77]]
#output of the table
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 1.0 2.0 6.0
2 0.0 1.0 5.0 6.0
3 0.0 3.0 4.0 6.0
I want to combine list with table so the final shape would be like:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
I found a way but it looks a bit long and might be a shorter way to do it.
Step1:
x = pd.Series(list, name ="ID")
new = pd.concat([table, x], axis=1)
# output
a b c d ID
0 0.0 1.0 2.0 3.0 [5, 6]
1 0.0 1.0 2.0 6.0 [77]
2 0.0 1.0 5.0 6.0 NaN
3 0.0 3.0 4.0 6.0 NaN
step2:
ID = new['ID'].apply(pd.Series)
ID = ID.rename(columns = lambda x : 'ID_' + str(x))
new_x = pd.concat([new[:], ID[:]], axis=1)
# output
a b c d ID ID_0 ID_1
0 0.0 1.0 2.0 3.0 [5, 6] 5.0 6.0
1 0.0 1.0 2.0 6.0 [77] 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN NaN
step3:
new_x = new_x.drop(columns=['ID'], axis = 1)
Any shorter way to achieve the same result?
Assuming a default index on table (as shown in the question), we can simply create a DataFrame (either from_records or with the constructor) and join back to table and allow the indexes to align. (add_prefix is an easy way to add the 'ID_' prefix to the default numeric columns)
new_df = table.join(
pd.DataFrame.from_records(lst).add_prefix('ID_')
)
new_df:
a b c d ID_0 ID_1
0 0.0 1.0 2.0 3.0 55.0 66.0
1 0.0 1.0 2.0 6.0 77.0 NaN
2 0.0 1.0 5.0 6.0 NaN NaN
3 0.0 3.0 4.0 6.0 NaN NaN
Working with 2 DataFrames is generally easier than a DataFrame and a list. Here is what from_records does to lst:
pd.DataFrame.from_records(lst)
0 1
0 55 66.0
1 77 NaN
Index (rows) 0 and 1 will now align with the corresponding index values in table (0 and 1 respectively).
add_prefix fixes the column names before joining:
pd.DataFrame.from_records(lst).add_prefix('ID_')
ID_0 ID_1
0 55 66.0
1 77 NaN
Setup and imports:
import pandas as pd # v1.4.4
table = pd.DataFrame({
'a': [0, 0, 0, 0],
'b': [1, 1, 1, 3, ],
'c': [2, 2, 5, 4],
'd': [3, 6, 6, 6]
}, dtype='float64')
lst = [[55, 66],
[77]]

Multiply columns values by a scalar based on conditions DataFrame

I want to multiply column values by a specific scalar based on the name of the column:
if column name = "Math", then all the values in 'Math" column should be multiply by 5;
if column name = "Physique", values in that column should be multiply by 4;
if column name = "Bio", values in that column should be multiplied by 3;
all the remaining columns should be multiplied by 2
What I have:
This is what I should have :
listm = ['Math', 'Physique', 'Bio']
def note_coef(row):
for m in listm:
if 'Math' in listm:
result = df['Math']*5
return result
df2=df.apply(note_coef)
df2
Note I stopped with only 1 if to test my code but the outcome is not what I expected. I am quite new in programming and here as well.
I think the most elegant solution is to define a dictionary (or a pandas.Series) with the multiplying factor for each column of your DataFrame (factors). Then you can multiply all the columns with the corresponding factor simply using df *= factors.
The multiplication is done via column axis alignment, i.e. by aligning the df.columns with the dictionary keys.
For instance, given the following DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.ones(shape=(4, 5)), columns=['Math', 'Physique', 'Bio', 'Algo', 'Archi'])
>>> df
Math Physique Bio Algo Archi
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0
You can do:
factors = {'Math': 5, 'Physique': 4, 'Bio': 3}
default_factor = 2
factors.update({col: default_factor for col in df.columns if col not in factors})
df *= factors
print(df)
Output:
Math Physique Bio Algo Archi
0 5.0 4.0 3.0 2.0 2.0
1 5.0 4.0 3.0 2.0 2.0
2 5.0 4.0 3.0 2.0 2.0
3 5.0 4.0 3.0 2.0 2.0
Fake data
n=5
d = {'a':np.ones(n),
'b':np.ones(n),
'c':np.ones(n),
'd':np.ones(n)}
df = pd.DataFrame(d)
print(df)
Select the columns and multiply by a tuple.
df[['a','c']] = df[['a','c']] * (2,4)
print(df)
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
a b c d
0 2.0 1.0 4.0 1.0
1 2.0 1.0 4.0 1.0
2 2.0 1.0 4.0 1.0
3 2.0 1.0 4.0 1.0
4 2.0 1.0 4.0 1.0
You can use df['col_name'].multiply(value) to apply on a whole column. The remaining columns can be modified in a loop of all columns except listm.
listm = ['Math', 'Physique', 'Bio']
for i, head in enumerate(listm):
df[head] = df[head].multiply(5-i)
heads = df.head()
for head in heads:
if not head in listm:
df[head] = df[head].multiply(2)
here is another way to do it using array multiplication
The data was not provided as a text, so created the test data in a patter of the screen shot
mul = [5,4,3,2,2,2,2,1] # multipliers
df1=df.iloc[:,1:].mul(mul)
df1.total = df1.iloc[:,:7].sum(axis=1)
df.update(df1, join='left', overwrite=True)
df
source Math Physics Bio Algo Archi Sport eng total
0 A 50.0 60.0 60.0 50.0 60.0 70.0 80.0 430.0
1 B 55.0 64.0 63.0 52.0 62.0 72.0 82.0 450.0
2 C 5.5 8.4 9.3 NaN NaN NaN NaN 23.2
3 D NaN NaN NaN 22.0 42.0 62.0 82.0 208.0
4 E 6.0 8.8 9.6 NaN NaN NaN NaN 24.4
5 F NaN NaN NaN 24.0 44.0 64.0 84.0 216.0
TEST DATA
data_out = [
['A', 10,15,20,25,30,35,40],
['B', 11,16,21,26,31,36,41],
['C', 1.1,2.1,3.1],
['D', np.NaN,np.NaN,np.NaN,11,21,31,41],
['E', 1.2,2.2,3.2],
['F', np.NaN,np.NaN,np.NaN,12,22,32,42],
]
df=pd.DataFrame(data_out, columns=[ 'source', 'Math', 'Physics', 'Bio', 'Algo', 'Archi', 'Sport', 'eng'])
df['total'] = df.iloc[:,1:].sum(axis=1)
source Math Physics Bio Algo Archi Sport eng total
0 A 10.0 15.0 20.0 25.0 30.0 35.0 40.0 175.0
1 B 11.0 16.0 21.0 26.0 31.0 36.0 41.0 182.0
2 C 1.1 2.1 3.1 NaN NaN NaN NaN 6.3
3 D NaN NaN NaN 11.0 21.0 31.0 41.0 104.0
4 E 1.2 2.2 3.2 NaN NaN NaN NaN 6.6
5 F NaN NaN NaN 12.0 22.0 32.0 42.0 108.0

How to fill missing value with different ways in python?

I have a dataset that has a number of numerical variables and a number of ordinal numeric variables. to fill missing value I want to use mean for numerical variables and use the median for the ordinal numeric variables. With the following code, each of them is created separately and is not collected in a database.
df = [['age', 'score'],
[10,1],
[20,""],
["",0],
[40,1],
[50,0],
["",3],
[70,1],
[80,""],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:])
df.columns = data[0]
df = df[['age']].fillna(df.mean())
df = df[['score']].fillna(df.median())
pandas.DataFrame.fillna accepts dict with keys being column names, so you might do:
import pandas as pd
data = [['age', 'score'],
[10,1],
[20,None],
[None,0],
[40,1],
[50,0],
[None,3],
[70,1],
[80,None],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:], columns=data[0])
df = df.fillna({'age':df['age'].mean(),'score':df['score'].median()})
print(df)
output
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
Keep in mind that empty string is different than NaN, latter might be created using python's None.
First replace empty strings to missing values and then replace mising values per columns:
df = df.replace('', np.nan)
df['age'] = df['age'].fillna(df['age'].mean())
df['score'] = df['score'].fillna(df['score'].median())
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
You can also use DataFrame.agg for Series of aggregate values and pass to DataFrame.fillna:
df = df.replace('', np.nan)
print (df.agg({'age':'mean', 'score':'median'}))
age 57.5
score 1.0
dtype: float64
df = df.fillna(df.agg({'age':'mean', 'score':'median'}))
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0

How do I select rows with at least a value in some selected columns? [duplicate]

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15

Categories