How can a custom function be applied to two data frames? The .apply method seems to iterate over rows or columns of a given dataframe, but I am not sure how to use this over two data frames at once. For example,
df1
m1 m2
x y x y z
0 0 10.0 12.0 16.0 17.0 9.0
0 10.0 13.0 15.0 12.0 4.0
1 0 11.0 14.0 14.0 11.0 5.0
1 3.0 14.0 12.0 10.0 9.0
df2
m1 m2
x y x y
0 0.5 0.1 1 0
In general, how can a mapping function of df1 to df2 make a new df3. For example, multiply (but I am looking for a generalized solution where I can just send to a function).
def custFunc(d1,d2):
return (d1 * d2) - d2
df1.apply(lambda x: custFunc(x,df2[0]),axis=1)
#df2[0] meaning it is explicitly first row
and a df3 would be
m1 m2
x y x y z
0 0 5.5 1.3 16.0 0.0 9.0
0 5.5 1.4 15.0 0.0 4.0
1 0 6.0 1.5 14.0 0.0 5.0
1 2.0 1.5 12.0 0.0 9.0
If need your function only pass DataFrame and Series with seelecting by row with DataFrame.loc, last for replace missing values by original is use DataFrame.fillna:
def custFunc(d1,d2):
return (d1 * d2) - d2
df = custFunc(df1, df2.loc[0]).fillna(df1)
print (df)
m1 m2
x y x y z
0 0 4.5 1.1 15.0 0.0 9.0
0 4.5 1.2 14.0 0.0 4.0
1 0 5.0 1.3 13.0 0.0 5.0
1 1.0 1.3 11.0 0.0 9.0
Detail:
print (df2.loc[0])
m1 x 0.5
y 0.1
m2 x 1.0
y 0.0
Name: 0, dtype: float64
Related
Most efficient for a big dataset in pandas:
I would like to add a new column Z taking the value from X if there is a value, if not, I want to take the value from Y.
Another thing, it there a possibility to use ternary operations to add a new column Z based on, if column Y exist then column Y - column X, if not, then only X.
I'm looking for the most efficient way in both cases.
Thank you
Use numpy.where:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'X':np.random.choice([np.nan, 1], size=N),
'Y':np.random.choice([3,4,6], size=N)})
df['Z1'] = np.where(df['X'].isna(), df['Y'],df['X'])
if 'Y' in df.columns:
df['Z2'] = np.where(df['X'] - df['Y'], df['Y'],df['X'])
else:
df['Z2'] = df['X']
print (df)
X Y Z1 Z2
0 NaN 6 6.0 6.0
1 1.0 4 1.0 4.0
2 NaN 6 6.0 6.0
3 NaN 3 3.0 3.0
4 NaN 3 3.0 3.0
... .. ... ...
9995 1.0 6 1.0 6.0
9996 1.0 6 1.0 6.0
9997 NaN 6 6.0 6.0
9998 1.0 4 1.0 4.0
9999 1.0 6 1.0 6.0
[10000 rows x 4 columns]
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0
I couldn't find an efficient away of doing that.
I have below DataFrame in Python with columns from A to Z
A B C ... Z
0 2.0 8.0 1.0 ... 5.0
1 3.0 9.0 0.0 ... 4.0
2 4.0 9.0 0.0 ... 3.0
3 5.0 8.0 1.0 ... 2.0
4 6.0 8.0 0.0 ... 1.0
5 7.0 9.0 1.0 ... 0.0
I need to multiply each of the columns from B to Z by A, (B x A, C x A, ..., Z x A), and save the results on new columns (R1, R2 ..., R25).
I would have something like this:
A B C ... Z R1 R2 ... R25
0 2.0 8.0 1.0 ... 5.0 16.0 2.0 ... 10.0
1 3.0 9.0 0.0 ... 4.0 27.0 0.0 ... 12.0
2 4.0 9.0 0.0 ... 3.0 36.0 0.0 ... 12.0
3 5.0 8.0 1.0 ... 2.0 40.0 5.0 ... 10.0
4 6.0 8.0 0.0 ... 1.0 48.0 0.0 ... 6.0
5 7.0 9.0 1.0 ... 0.0 63.0 7.0 ... 0.0
I was able to calculate the results using below code, but from here I would need to merge with original df. Doesn't sound efficient. There must be a simple/clean way of doing that.
df.loc[:,'B':'D'].multiply(df['A'], axis="index")
That's an example, my real DataFrame has 160 columns x 16k rows.
Create new columns names by list comprehension and then join to original:
df1 = df.loc[:,'B':'D'].multiply(df['A'], axis="index")
df1.columns = ['R{}'.format(x) for x in range(1, len(df1.columns) + 1)]
df = df.join(df1)
print (df)
A B C Z R1 R2
0 2.0 8.0 1.0 5.0 16.0 2.0
1 3.0 9.0 0.0 4.0 27.0 0.0
2 4.0 9.0 0.0 3.0 36.0 0.0
3 5.0 8.0 1.0 2.0 40.0 5.0
4 6.0 8.0 0.0 1.0 48.0 0.0
5 7.0 9.0 1.0 0.0 63.0 7.0
From a Pandas DataFrame, I want to select columns where the value of the first row is between a certain range (e.g, 0.5 - 1.1)
I can select columns where row 0 is greater than or less than a certain number by doing this:
df = pd.DataFrame(example).T
Result = df[df.iloc[:, 0] > 0.5].T
How do I do this for a range (i.e, greater than 0.5 and less than 1).
Thanks.
You can use between:
print (df[df.iloc[:, 0].between(0.5, 1.1)])
Another solution with conditions with & (array and):
print (df[(df.iloc[:, 0] > 0.5) & (df.iloc[:, 0] < 1.1)])
Sample:
df = pd.DataFrame({'a':[1.1,1.4,0.7,0,0.5]})
print (df)
a
0 1.1
1 1.4
2 0.7
3 0.0
4 0.5
#inclusive True is by default
print (df[df.iloc[:, 0].between(0.5, 1.1)])
a
0 1.1
2 0.7
4 0.5
#added inclusive False
print (df[df.iloc[:, 0].between(0.5, 1.1, inclusive=False)])
a
2 0.7
print (df[(df.iloc[:, 0] > 0.5) & (df.iloc[:, 0] < 1.1)])
a
2 0.7
But if need select columns by first row add loc:
df = pd.DataFrame({'A':[1.1,2,3],
'B':[.4,5,6],
'C':[.7,8,9],
'D':[1.0,3,5],
'E':[.5,3,6],
'F':[.7,4,3]})
print (df)
A B C D E F
0 1.1 0.4 0.7 1.0 0.5 0.7
1 2.0 5.0 8.0 3.0 3.0 4.0
2 3.0 6.0 9.0 5.0 6.0 3.0
print (df.loc[:, df.iloc[0, :].between(0.5, 1.1)])
A C D E F
0 1.1 0.7 1.0 0.5 0.7
1 2.0 8.0 3.0 3.0 4.0
2 3.0 9.0 5.0 6.0 3.0
print (df.loc[:, df.iloc[0, :].between(0.5, 1.1, inclusive=False)])
C D F
0 0.7 1.0 0.7
1 8.0 3.0 4.0
2 9.0 5.0 3.0
print (df.loc[:, (df.iloc[0, :] > 0.5) & (df.iloc[0, :] < 1.1)])
C D F
0 0.7 1.0 0.7
1 8.0 3.0 4.0
2 9.0 5.0 3.0