Pandas apply function to multindexed columns that takes columns (Series) as arguments

Pandas apply function to multindexed columns that takes columns (Series) as arguments - python

I need to apply a function that takes subcolumns (aka Series) of multiindexed columns as arguments. I have come up with a solution that works, but I was curious if there was a more pythonic/proper pandas way to do this.
Let's say we have a function that takes two series as arguments and performs some user-defined operation on those series and returns a single series.
import pandas as pd
def user_defined_function(series1, series2):
return 12 * (series1 * series2 / 3)
Lets create a dataframe with multindexed columns.
data = [[1, 2, 3, 4],
[5, 6, 7, 8],
[10, 11, 12, 13],
[14, 15, 16, 17]]
columns = (('A', 'sub_col_1'),
('A', 'sub_col_2'),
('B', 'sub_col_1'),
('B', 'sub_col_2'))
df = pd.DataFrame(data, columns=columns)
print(df)
A B
sub_col_1 sub_col_2 sub_col_1 sub_col_2
0 1 2 3 4
1 5 6 7 8
2 10 11 12 13
3 14 15 16 17
I want to appy my user_defined_function() to the sub columns of A and B.
Now if you try and use apply traditionally pandas will traverse each column individually returning a single series to the function. So you can't just do this.
df.apply(lambda x: user_defined_function(x['sub_col_1'], x['sub_col_2']))
You'll end up getting a key error because pandas is passing a series not a normally indexed "sub dataframe."
So this is the solution I came up with.
level1_labels = set(df.columns.get_level_values(0))
processed_df = pd.DataFrame()
for label in level1_labels:
data_to_apply_function_to = df[label]
processed_series = user_defined_function(data_to_apply_function_to['sub_col_1'],
data_to_apply_function_to['sub_col_2'])
processed_df[label] = processed_series
print(processed_df)
A B
0 8.0 48.0
1 120.0 224.0
2 440.0 624.0
3 840.0 1088.0
This returns what I want it to. However, I am curious if there is a cleaner, more pythonic, proper way to do this.

You can groupby over the columns axis. Your function requires a Series so we'll need to squeeze if we want to select by label.
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.xs('sub_col_1', level=1, axis=1).squeeze(),
gp.xs('sub_col_2', level=1, axis=1).squeeze()))
)
# A B
#0 8.0 48.0
#1 120.0 224.0
#2 440.0 624.0
#3 840.0 1088.0
A bit more error prone, though fine if you know all groups have two Series in the same positions
(df.groupby(level=0, axis=1)
.apply(lambda gp: user_defined_function(gp.iloc[:, 0], gp.iloc[:, 1]))
)

It looks to me that this is a very custom case. It is actually possible to use apply within the 0 level columns as following
import pandas as pd
# I just renamed it cos was very long
def udf(series1, series2):
return 12 * (series1 * series2 / 3)
col = "A"
df[col].apply(lambda x: udf(x['sub_col_1'], x['sub_col_2']),axis=1)\
.to_frame()\
.rename(columns={0:col})
returns
A
0 8.0
1 120.0
2 440.0
3 840.0
But again for the output you are looking for you should still need to loop.
out = []
for col in set(df.columns.get_level_values(0)):
out.append(
df[col].apply(lambda x: udf(x['sub_col_1'],
x['sub_col_2']),
axis=1)\
.to_frame()\
.rename(columns={0:col}))
out = pd.concat(out, axis=1)

Related

Fill NA values in dataframe by mapping a double indexed groupby object [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.

One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3

fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.

#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))

Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s

I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))

To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

How to use groupby max in own groupby function?

I have the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
I am trying to include a new column obtained by dividing 'A' by the highest 'B' it is category ('CAT'). That is, I want to divide 10 by 10, 20 by 4, 10 by 10 and 40 by 4 to obtain the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3], 'C':[1,5,3,10]}
Any suggestions?
I find it easy to do without having to condition/groupby on CAT
d = {'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
df1 = df1.apply(lambda x:x.A/max(df1['B']),axis=1)
but with 'CAT' I am having a hard time.

You could do this in one line; I only broke it into separate lines for more clarity. transform allows replication of the groupby accross the entire dataframe; with that we can get the results for column C :
grouping = df1.groupby("CAT").B.transform("max")
df1['C'] = df1.A.div(grouping)
df1
CAT A B C
0 C1 10 3 1.0
1 C2 20 4 5.0
2 C1 30 10 3.0
3 C2 40 3 10.0

you're pretty much most of the way there with using apply. Depending on how big your actual dataset it, using apply could work out as inefficient, but ignoring that, you can solve your problem by the 'max' function on a filter of the dataframe rather than the df itself.
Or, just to get to the code:
df1['calculation'] = df1.apply(lambda row: row['A'] / max(df1[df1['CAT'] == row['CAT']]['B']), axis=1)

Fillna in pandas with respect to similar line [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.

One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3

fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.

#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))

Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s

I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))

To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).

def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)

I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.

To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Efficient way to merge multiple large DataFrames

Suppose I have 4 small DataFrames
df1, df2, df3 and df4
import pandas as pd
from functools import reduce
import numpy as np
df1 = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
df2 = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
df3 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
df4 = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df1.columns = ['name', 'id', 'price']
df2.columns = ['name', 'id', 'price']
df3.columns = ['name', 'id', 'price']
df4.columns = ['name', 'id', 'price']
df1 = df1.rename(columns={'price':'pricepart1'})
df2 = df2.rename(columns={'price':'pricepart2'})
df3 = df3.rename(columns={'price':'pricepart3'})
df4 = df4.rename(columns={'price':'pricepart4'})
Create above are the 4 DataFrames, what I would like is in the code below.
# Merge dataframes
df = pd.merge(df1, df2, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df3, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
df = pd.merge(df , df4, left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
# Fill na values with 'missing'
df = df.fillna('missing')
So I have achieved this for 4 DataFrames that don't have many rows and columns.
Basically, I want to extend the above outer merge solution to MULTIPLE (48) DataFrames of size 62245 X 3:
So I came up with this solution by building from another StackOverflow answer that used a lambda reduce:
from functools import reduce
import pandas as pd
import numpy as np
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
#The solution I came up with to extend the solution to more than 3 DataFrames
df_merged = reduce(lambda left, right: pd.merge(left, right, left_on=['name', 'id'], right_on=['name', 'id'], how='outer'), dfList).fillna('missing')
This is causing a MemoryError.
I do not know what to do to stop the kernel from dying.. I've been stuck on this for two days.. Some code for the EXACT merge operation that I have performed that does not cause the MemoryError or something that gives you the same result, would be really appreciated.
Also, the 3 columns in the main DataFrame (NOT the reproducible 48 DataFrames in the example) are of type int64, int64 and float64 and I'd prefer them to stay that way because of the integer and float that it represents.
EDIT:
Instead of iteratively trying to run the merge operations or using the reduce lambda functions, I have done it in groups of 2! Also, I've changed the datatype of some columns, some did not need to be float64. So I brought it down to float16. It gets very far but still ends up throwing a MemoryError.
intermediatedfList = dfList
tempdfList = []
#Until I merge all the 48 frames two at a time, till it becomes size 2
while(len(intermediatedfList) != 2):
#If there are even number of DataFrames
if len(intermediatedfList)%2 == 0:
#Go in steps of two
for i in range(0, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
#Append it to this list
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
else:
#If there are odd number of DataFrames, keep the first DataFrame out
tempdfList = [intermediatedfList[0]]
#Go in steps of two starting from 1 instead of 0
for i in range(1, len(intermediatedfList), 2):
#Merge DataFrame in index i, i + 1
df1 = pd.merge(intermediatedfList[i], intermediatedfList[i + 1], left_on=['name', 'id'], right_on=['name', 'id'], how='outer')
print(df1.info(memory_usage='deep'))
tempdfList.append(df1)
#After DataFrames in intermediatedfList merging it two at a time using an auxillary list tempdfList,
#Set intermediatedfList to be equal to tempdfList, so it can continue the while loop.
intermediatedfList = tempdfList
Is there any way I can optimize my code to avoid MemoryError, I've even used AWS 192GB RAM (I now owe them 7$ which I could've given one of yall), that gets farther than what I've gotten, and it still throws MemoryError after reducing a list of 28 DataFrames to 4..

You may get some benefit from performing index-aligned concatenation using pd.concat. This should hopefully be faster and more memory efficient than an outer merge as well.
df_list = [df1, df2, ...]
for df in df_list:
df.set_index(['name', 'id'], inplace=True)
df = pd.concat(df_list, axis=1) # join='inner'
df.reset_index(inplace=True)
Alternatively, you can replace the concat (second step) by an iterative join:
from functools import reduce
df = reduce(lambda x, y: x.join(y), df_list)
This may or may not be better than the merge.

Seems like part of what dask dataframes were designed to do (out of memory ops with dataframes). See
Best way to join two large datasets in Pandas for example code. Sorry not copying and pasting but don't want to seem like I am trying to take credit from answerer in linked entry.

You can try a simple for loop. The only memory optimization I have applied is downcasting to most optimal int type via pd.to_numeric.
I am also using a dictionary to store dataframes. This is good practice for holding a variable number of variables.
import pandas as pd
dfs = {}
dfs[1] = pd.DataFrame([['a', 1, 10], ['a', 2, 20], ['b', 1, 4], ['c', 1, 2], ['e', 2, 10]])
dfs[2] = pd.DataFrame([['a', 1, 15], ['a', 2, 20], ['c', 1, 2]])
dfs[3] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 1]])
dfs[4] = pd.DataFrame([['d', 1, 10], ['e', 2, 20], ['f', 1, 15]])
df = dfs[1].copy()
for i in range(2, max(dfs)+1):
df = pd.merge(df, dfs[i].rename(columns={2: i+1}),
left_on=[0, 1], right_on=[0, 1], how='outer').fillna(-1)
df.iloc[:, 2:] = df.iloc[:, 2:].apply(pd.to_numeric, downcast='integer')
print(df)
0 1 2 3 4 5
0 a 1 10 15 -1 -1
1 a 2 20 20 -1 -1
2 b 1 4 -1 -1 -1
3 c 1 2 2 -1 -1
4 e 2 10 -1 20 20
5 d 1 -1 -1 10 10
6 f 1 -1 -1 1 15
You should not, as a rule, combine strings such as "missing" with numeric types, as this will turn your entire series into object type series. Here we use -1, but you may wish to use NaN with float dtype instead.

So, you have 48 dfs with 3 columns each - name, id, and different column for every df.
You don`t must to use merge....
Instead, if you concat all the dfs
df = pd.concat([df1,df2,df3,df4])
You will recieve:
Out[3]:
id name pricepart1 pricepart2 pricepart3 pricepart4
0 1 a 10.0 NaN NaN NaN
1 2 a 20.0 NaN NaN NaN
2 1 b 4.0 NaN NaN NaN
3 1 c 2.0 NaN NaN NaN
4 2 e 10.0 NaN NaN NaN
0 1 a NaN 15.0 NaN NaN
1 2 a NaN 20.0 NaN NaN
2 1 c NaN 2.0 NaN NaN
0 1 d NaN NaN 10.0 NaN
1 2 e NaN NaN 20.0 NaN
2 1 f NaN NaN 1.0 NaN
0 1 d NaN NaN NaN 10.0
1 2 e NaN NaN NaN 20.0
2 1 f NaN NaN NaN 15.0
Now you can group by name and id and take the sum:
df.groupby(['name','id']).sum().fillna('missing').reset_index()
If you will try it with the 48 dfs you will see it solves the MemoryError:
dfList = []
#To create the 48 DataFrames of size 62245 X 3
for i in range(0, 49):
dfList.append(pd.DataFrame(np.random.randint(0,100,size=(62245, 3)), columns=['name', 'id', 'pricepart' + str(i + 1)]))
df = pd.concat(dfList)
df.groupby(['name','id']).sum().fillna('missing').reset_index()

Pass a pd.Series to a dataframe?

I tried the following code but the new column consists of only NAN values.
df['new'] = pd.Series(np.repeat(1, len(df)))
Can someone explain to me what the problem is here?

It is possible that the index of the DataFrame df does not match with the newly created Series'. For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [11, 22, 33, 44, 55]}, index=['r1','r2','r3','r4','r5'])
df['new'] = pd.Series(np.repeat(1, len(df)))
print df
and the output will be:
a new
r1 11 NaN
r2 22 NaN
r3 33 NaN
r4 44 NaN
r5 55 NaN
since the index of pd.Series(np.repeat(1, len(df))) is Int64Index([0, 1, 2, 3, 4], dtype='int64').
To prevent that, specify the index argument when creating the Series:
df['new'] = pd.Series(np.repeat(1, len(df)), index=df.index)
Alternatively, you can just pass a numpy array if the index is to be ignored:
df['new'] = np.repeat(1, len(df))
without needing to create a Series (in fact, df['new'] = 1 will do for this case). Using a Series is helpful when you need to align the new column with the existing DataFrame using the index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas apply function to multindexed columns that takes columns (Series) as arguments - python

Related

Fill NA values in dataframe by mapping a double indexed groupby object [duplicate]

How to use groupby max in own groupby function?

Fillna in pandas with respect to similar line [duplicate]

Efficient way to merge multiple large DataFrames

Pass a pd.Series to a dataframe?

Categories

Resources