I'm trying to do an, apparently, simple operation in python:
I have some datasets, say 6, and I want to sum the values of one column if the values of the other two columns coincides. After that, I want to divide the values of the column which has been summed by the number of datasets I have, in this case, 6 (i.e. Calculate the arithmetic mean). Also I want to sum 0 if the values of the other columns doesn't coincide.
I write down here two dataframes, as example:
Code1 Code2 Distance
0 15.0 15.0 2
1 15.0 60.0 3
2 15.0 69.0 2
3 15.0 434.0 1
4 15.0 842.0 0
Code1 Code2 Distance
0 14.0 15.0 4
1 14.0 60.0 7
2 15.0 15.0 0
3 15.0 60.0 1
4 15.0 69.0 9
The first column is the df.index column. Then , I want to sum 'Distance' column only if 'Code1' and 'Code2' columns coincide. In this case the desired output would be something like:
Code1 Code2 Distance
0 14.0 15.0 2
1 14.0 60.0 3.5
2 15.0 15.0 1
3 15.0 60.0 2
4 15.0 69.0 5.5
5 15.0 434.0 0.5
6 15.0 842.0 0
I've tried to do this using conditionals, but for more than two df is really hard to do. Is there any method in Pandas to do it faster?
Any help would be appreciated :-)
You could put all your data frames in a list and then use reduce to either append or merge them all.
Take a look at reduce here.
First, below some functions are defined for sample data generation.
import pandas
import numpy as np
# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
return np.floor(np.random.rand(n,1) * 3 + 13)
# Code 2 between 1 and 1000
def generate_code_2(n):
return np.floor(np.random.rand(n,1) * 1000) + 1
# Distance between 0 and 9
def generate_distance(n):
return np.floor(np.random.rand(n,1) * 10)
# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
data = np.hstack([
generate_code_1(n)
,generate_code_2(n)
,generate_distance(n)
])
df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
# Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
# Duplications will break merge method however will not break append method
df = df.groupby(['Code 1', 'Code 2'], as_index=False)
df = df.aggregate(np.min)
return df
# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
df_list = []
for k in range(0, n):
df = generate_data_frame(m)
# Add count column, needed for merge method to keep track of how many cases we have seen
if with_count:
df['Count'] = 1
df_list.append(df)
return df_list
Append method (faster, shorter, nicer)
df_list = generate_data_frames(94, 5)
# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)
# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result
Merge method
df_list = generate_data_frames(94, 5, with_count=True)
# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
df = df.fillna(0)
df['Distance'] = df['Distance'] + df['Distance_y']
df['Count'] = df['Count'] + df['Count_y']
del df['Distance_y']
del df['Count_y']
return df
# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)
# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result
Related
I have two dataframes:
df1 is a reference table with a list of individual codes and their corresponding values.
df2 is a excerpt from a larger dataset, wherein one of the columns will contain multiple examples of the codes. It will also contain other values I want to ignore e.g. blanks and 'Not Applicable'.
I need to split out each individual code from df2 and find the corresponding value from the reference table df1. I then want to return a column in df2 with the maximum value from the entire string of codes.
import pandas as pd
df1 = [['H302',18],
['H312',17],
['H315',16],
['H316',15],
['H319',14],
['H320',13],
['H332',12],
['H304',11]]
df1 = pd.DataFrame(df1, columns=['Code', 'Value'])
df2 = [['H302,H304'],
['H332,H319,H312,H320,H316,H315,H302,H304'],
['H315,H312,H316'],
['H320,H332,H316,H315,H304,H302,H312'],
['H315,H319,H312,H316,H332'],
['H312'],
['Not Applicable'],
['']]
df2 = pd.DataFrame(df2, columns=['Code'])
I had previously used the following:
df3 = []
for i in range(len(df2)):
df3.append(df2['Code'][i].split(","))
max_values = []
for i in range(len(df3)):
for j in range(len(df3[i])):
for index in range(len(df1)):
if df1['Code'][index] == df3[i][j]:
df3[i][j] = df1['Value'][index]
max_values.append(max(df3[i]))
df2["Max Value"] = max_values
However, the .append function is being removed and when used I get the following error "'>' not supported between instances of 'numpy.ndarray' and 'str'"
Code
df2['max'] = (
df2['Code']
.str.split(',')
.explode()
.map(df1.set_index('Code')['Value'])
.groupby(level=0).max()
)
How it works?
Split by delimiter ,
Explode to convert lists to rows
Use map to substitute values from df1
Groupby on level=0 to find max value per row group
Result
Code max
0 H302,H304 18.0
1 H332,H319,H312,H320,H316,H315,H302,H304 18.0
2 H315,H312,H316 17.0
3 H320,H332,H316,H315,H304,H302,H312 18.0
4 H315,H319,H312,H316,H332 17.0
5 H312 17.0
6 Not Applicable NaN
7 NaN
One can use pandas.Series.apply with a custom lambda function as follows
df2['Max_Value'] = df2['Code'].apply(lambda x: max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0))
[Out]:
Code Max_Value
0 H302,H304 18
1 H332,H319,H312,H320,H316,H315,H302,H304 18
2 H315,H312,H316 17
3 H320,H332,H316,H315,H304,H302,H312 18
4 H315,H319,H312,H316,H332 17
5 H312 17
6 Not Applicable 0
7 0
Given the first note below, if one doesn't want to use .apply(), one can use a list comprehension as follows
df2['Max_Value'] = [max([df1.loc[df1['Code'] == i, 'Value'].values[0] for i in x.split(',') if i != 'Not Applicable' and i != ''], default=0) for x in df2['Code']]
[Out]:
Code Max_Value
0 H302,H304 18
1 H332,H319,H312,H320,H316,H315,H302,H304 18
2 H315,H312,H316 17
3 H320,H332,H316,H315,H304,H302,H312 18
4 H315,H319,H312,H316,H332 17
5 H312 17
6 Not Applicable 0
7 0
Notes:
There are strong opinions on using .apply(), so one might want to read this.
As an alternative:
keys=df1['Code'].to_list()
df2["Code"] = df2["Code"].str.split(',')
def get_max(x):
max_=[]
for i in x:
if i in keys:
max_.append(df1.loc[df1.Code==i, 'Value'].values[0])
else:
pass
if len(max_)>0:
return max(max_)
else:
[]
df2['max_value']=df2['Code'].apply(lambda x:get_max(x))
print(df2)
'''
Code max_value
0 ['H302', 'H304'] 18.0
1 ['H332', 'H319', 'H312', 'H320', 'H316', 'H315', 'H302', 'H304'] 18.0
2 ['H315', 'H312', 'H316'] 17.0
3 ['H320', 'H332', 'H316', 'H315', 'H304', 'H302', 'H312'] 18.0
4 ['H315', 'H319', 'H312', 'H316', 'H332'] 17.0
5 ['H312'] 17.0
6 ['Not Applicable'] nan
7 [''] nan
'''
I am applying an inner join in a for loop on another dataset and now I just need to remove the rows that are already part of the inner join so I went with Dataframe.isin(another_df) but it is not giving me the expected results. I checked the column names and their data types, they are all the same. Can someone help me with that, please?
In the following code, isin is where I check between two data frames still I'm not getting any response, I'm getting the same set of rows even if they have the same no of rows and columns.
Note: I'm dropping an extra column in isin function as it is the extra column present in one of the dataframes.
My code looks like this:
df = pd.DataFrame(columns= override.columns)
for i in list1:
join_value = tuple(i)
i.append('creditor_tier_interim')
subset_df = override.merge(criteria[i].dropna(), on = list(join_value), how = 'inner')
subset_df['PRE_CHARGEOFF_FLAG'] = pd.to_numeric(subset_df.PRE_CHARGEOFF_FLAG)
override=override[~override.isin(subset_df.drop(columns = 'creditor_tier_interim'))].dropna(how = 'all')
print('The override shape would be:', override.shape)
df = df.append(subset_df)
df = df.append(override)
It sounds as if you have 'left' and a 'right' DataFrames and you're look for those records that are exclusively in one or the other. The below returns rows that are in exclusively the right or left DataFrame.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
dataframe_left = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
dataframe_right = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
insert_left = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'],index=[7])
insert_right = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'], index=[6])
dataframe_right = dataframe_right.append(insert_right)
dataframe_left = dataframe_left.append(insert_left)
Code above produces this output
Left Table
A
B
C
D
E
0
-0.3240086903973736
1.0441549453943946
-0.23640436950107843
0.5466767470739027
-0.2123693649877372
1
-0.04263388410830733
-0.4855492977594353
-1.5584284407735072
1.2438524586306603
-0.31087239909921277
2
0.6982581750529829
-0.42379154444215905
1.1625089013522614
-3.378898146269229
1.0550121763954057
3
0.3774337535208665
0.6402576096348337
-0.2787520258645991
0.31071767629270125
0.34499495360962007
4
-0.133649590435452
0.3679768579635411
-2.0196709364730014
1.2860033685128436
-0.49674737879741193
7
0.0
1.0
2.0
3.0
4.0
Right Table
A
B
C
D
E
0
-0.09946693056759418
-0.03378933704588447
-0.4117873368048701
0.21976489856531914
-0.7020527418892488
1
-2.9936183481793233
0.42443360961021837
-0.1681576564885903
-0.5080538565354785
-0.29483296271514153
2
-0.6567306172004121
-1.221239625798079
-1.2604670988941196
0.44472543746187265
-0.4562966381137614
3
-0.0027697712245823482
0.1323767897141191
-0.11073953230359104
-0.3596157927825233
1.9894525572891626
4
0.5170901011452596
-1.1694605240821456
0.29238712582282705
-0.38912521589557797
-0.8793074660039492
6
0.0
1.0
2.0
3.0
4.0
After setting up the test dataframes we can join the two and filter for the rows we're interested in:
tmp = pd.merge(
left=dataframe_left,
right=dataframe_right,
right_index=True,
left_index=True,
how='outer',
suffixes=['_left','_right'],
indicator=True
)
tmp[tmp._merge.isin(['right_only','left_only'])]
This produces the below result
A_left
B_left
C_left
D_left
E_left
A_right
B_right
C_right
D_right
E_right
_merge
6
0.0
1.0
2.0
3.0
4.0
right_only
7
0.0
1.0
2.0
3.0
4.0
left_only
I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.
The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:
1) df1.id = df2.id (usual case of merge)
2) df2.value_2A <= df1.value_1 <= df2.value_2B
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
df1
Out[13]:
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
Out[14]:
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3
desired_output
Out[15]:
id value_1 value_2A value_2B
0 1 2 NaN NaN
1 1 5 3.0 5.0
2 1 7 7.0 9.0
3 2 1 0.0 1.0
4 2 3 2.0 4.0
5 2 3 3.0 6.0
6 3 4 NaN NaN
now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.
## This is NOT a solution because memory usage is just too large and
## too many oprations deeming it extremely inefficient and slow at large scale
output = pd.merge(df1, df2, on='id', how='left') ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')
This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!
Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after.
This is done easily in R:
## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation
is there any way to do this in Python?
interesting question!
Looks like pandasql might do what you want. Please see :
How to do a conditional join in python Pandas?
Yeah. It's an annoying problem. I handled this by splitting the left DataFrame into chunks.
def merge_by_chunks(left, right, condition=None, **kwargs):
chunk_size = 1000
merged_chunks = []
for chunk_start in range(0, len(left), chunk_size):
print(f"Merged {chunk_start} ", end="\r")
merged_chunk = pd.merge(left=left[chunk_start: chunk_start+chunk_size], right=right, **kwargs)
if condition is not None:
merged_chunk = merged_chunk[condition(merged_chunk)]
merged_chunks.append(merged_chunk)
return pd.concat(merged_chunks)
Then you can provide the condition as a function.
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
def condition_func(output):
return (((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])))
output = merge_by_chunks(df1, df2, condition=condition_func, on='id', how='left')
merge_by_chunks(df1, output, on=['id', 'value_1'], how='left')
It can be pretty slow depending on the size of the DataFrame, but it doesn't run out of memory.
I have a large dataset with 3 columns:
sku center units
0 103896 1 2.0
1 103896 1 0.0
2 103896 1 5.0
3 103896 1 0.0
4 103896 1 7.0
5 103896 1 0
And I need to use a groupby-apply.
def function_a(x):
return np.sum((x > 0).iloc[::-1].cumsum() == 0)
def function_b(x):
return x.eq(0).sum()/((x.eq(0)&x.shift().ne(0)).sum())
Using dask (df.groupby(['sku', 'center'])['units'].apply(function_a), meta=(float)), I have many problems applying the first function because dask does not support index operations (.iloc), and the results are totally wrong.
Is it possible to apply those function using pyspark UDF ?
Assumptions
Your index (in the above example (0, 1, 2, 3, 4, 5)) corresponds to the correct sorting that you want. E.g. by the data being CSVs of the form
0,103896,1,2.0
1,103896,1,0.0
2,103896,1,5.0
where the first columns corresponds the sample number. When you then read the data with:
import dask.dataframe as dd
df = dd.read_csv('path/to/data_*.csv', header=None)
df.columns = ['id', 'sku', 'center', 'units']
df = df.set_index('id')
this gives you a deterministic DataFrame. Meaning the index of the data is the same, no matter in what order the data is read from the drive.
Solution to .iloc() problem
You can then change function_a(x): to:
def function_a(x):
return np.sum((x.sort_index(ascending=False) > 0).cumsum() == 0)
which should now work with
df.groupby(['sku', 'center'])['units'].apply(function_a, meta=(float))
Using the same example from here but just changing the 'A' column to be something that can easily be grouped by:
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df["A"] = pd.Series([1]*3+ [2]*8)
df.head()
whose output now is:
Date A B C D E F G
0 2008-03-18 1 164.93 114.73 26.27 19.21 28.87 63.44
1 2008-03-19 1 164.89 114.75 26.22 19.07 27.76 59.98
2 2008-03-20 1 164.63 115.04 25.78 19.01 27.04 59.61
3 2008-03-25 2 163.92 114.85 27.41 19.61 27.84 59.41
4 2008-03-26 2 163.45 114.84 26.86 19.53 28.02 60.09
5 2008-03-27 2 163.46 115.40 27.09 19.72 28.25 59.62
6 2008-03-28 2 163.22 115.56 27.13 19.63 28.24 58.65
Doing the cumulative sums (code from the linked question) works well when we're assuming it's a single list:
# Put your inputs into a single list
input_cols = ["B", "C"]
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors1'] = df["single_input_vector"].cumsum()
but how do I cumsum the lists in this case grouped by 'A'? I expected this to work but it doesnt:
df['cumu'] = df.groupby("A")["single_input_vector"].apply(lambda x: list(x)).cumsum()
Instead of [[164.93, 114.73, 26.27], [164.89, 114.75, 26.... I get some rows filled in, others are NaN's. This is what I want (cols [B,C] accumulated into groups of col A):
A cumu
0 1 [[164.93,114.73], [164.89,114.75], [164.63,115.04]]
0 2 [[163.92,114.85], [163.45,114.84], [163.46,115.40], [163.22, 115.56]]
Also, how do I do this in an efficient manner? My dataset is quite big (about 2 million rows).
It doesn't look like your doing arithmetic sum, more like a concat along axis=1
First groupby and concat
temp_series = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])])
0 [[164.93, 114.73], [164.89, 114.75], [164.63, ...
1 [[163.92, 114.85], [163.45, 114.84], [163.46, ...
then convert back to a dataframe
df = temp_series.reset_index().rename(columns={0: 'cumsum'})
In one line
df = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])]).reset_index().rename(columns={0: 'cumsum'})