Conditional counting for group variables - python

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({"Shop_type": [1,2,3,3,2,3,1,2,1],
"Self_managed" : [True,False,False,True,True,False,False,True,False],
"Support_required" : [True,True,True,False,False,False,False,False,True]})
My goal is to get an overview of the number of number of Self_managed shops and Support_required shops somewhat looking like this:
Shop_type Self_count Supprt_count
0 1 1 2
1 2 2 1
2 3 1 1
Currently I use the following code to achieve this, but it looks very long and unprofessional. Since I am still learning Python, I would like to improve and have more efficient code. Any ideas?
df1 = df[df["Self_managed"] == True]
df1 = df1.groupby(['Shop_type']).size().reset_index(name='Self_count')
df2 = df[df["Support_required"] == True]
df2 = df2.groupby(['Shop_type']).size().reset_index(name='Supprt_count')
df = df1.merge(df2, how = "outer", on="Shop_type")

Seems like you need
df.groupby('Shop_type',as_index=False).sum()
Out[298]:
Shop_type Self_managed Support_required
0 1 1.0 2.0
1 2 2.0 1.0
2 3 1.0 1.0

Related

How fulfil empy df by FOR loop

I need to create a dataframe with two columns: variable, function based on this variable. There is an error in case of next code:
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
k = 0.5**i
test.append(i, k)
print(test)
TypeError: cannot concatenate object of type '<class 'int'>'; only Series and DataFrame objs are valid
What do I need to fix here? Looks like answer is easy, however it is uneasy to find it...
Many thanks for your help
Is there a specific reason you are trying to use the loop? You can create df with column_1 and use Pandas vectorized operations to create column_2
df = pd.DataFrame(np.arange(1,30), columns = ['Column_1'])
df['Column_2'] = 0.5**df['Column_1']
Column_1 Column_2
0 1 0.50000
1 2 0.25000
2 3 0.12500
3 4 0.06250
4 5 0.03125
I like Vaishali's way of approaching it. If you really want to use the for loop, this is how I would of done it:
import pandas as pd
test = pd.DataFrame({'Column_1': pd.Series([], dtype='int'),
'Column_2': pd.Series([], dtype='float')})
for i in range(1,30):
test=test.append({'Column_1':i,'Column_2':0.5**i},ignore_index=True)
test = test.round(5)
print(test)
Column_1 Column_2
0 1.0 0.50000
1 2.0 0.25000
2 3.0 0.12500
3 4.0 0.06250
4 5.0 0.03125
5 6.0 0.01562

Pandas sampling a dataframe but treating multiple rows as a single row based on column

Consider the following toy code that performs a simplified version of my actual question:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,2,3,4,5],
'some column': [0,1,2,3,4],
}
)
df = df.set_index(['n_event'])
print(df)
resampled_df = df.sample(frac=1, replace=True)
print(resampled_df)
The resampled_df is, as it name suggests, a resampled version of the original one (with replacement). This is exactly what I want. An example output of the previous code is
some column
n_event
1 0
2 1
3 2
4 3
5 4
some column
n_event
4 3
1 0
4 3
4 3
2 1
Now for my actual question I have the following dataframe:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
}
)
df = df.set_index(['n_event','n_channel'])
print(df)
which looks like
some column
n_event n_channel
1 1 0
2 1
2 1 2
2 3
3 1 4
2 5
4 1 6
2 7
5 1 8
2 9
I want to do exactly the same as before, resample with replacements, but treating each group of rows with the same n_event as a single entity. A hand-built example of what I want to do can look like this:
some column
n_event n_channel
2 1 2
2 3
2 1 2
2 3
3 1 4
2 5
1 1 0
2 1
5 1 8
2 9
As seen, each n_event was treated as a whole and things within each event were no mixed up.
How can I do this without proceeding by brute force (i.e. without for loops, etc)?
I have tried with df.sample(frac=1, replace=True, ignore_index=False) and a few things using group_by without success.
Would a pivot()/melt() sequence work for you?
Use pivot() to from long to wide (make each group a single row).
Do the sampling.
Then back from wide to long using melt().
Don't have time to work out a full answer but thought I would get this idea to you in case it might help you.
Following the suggestion of jch I was able to find a solution by combining pivot and stack:
import pandas
df = pandas.DataFrame(
{
'n_event': [1,1,2,2,3,3,4,4,5,5],
'n_channel': [1,2,1,2,1,2,1,2,1,2],
'some column': [0,1,2,3,4,5,6,7,8,9],
'other col': [5,6,4,3,2,5,2,6,8,7],
}
)
resampled_df = df.pivot(
index = 'n_event',
columns = 'n_channel',
values = set(df.columns) - {'n_event','n_channel'},
)
resampled_df = resampled_df.sample(frac=1, replace=True)
resampled_df = resampled_df.stack()
print(resampled_df)

Conditional merge / join of two large Pandas DataFrames with duplicated keys based on values of multiple columns - Python

I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.
The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:
1) df1.id = df2.id (usual case of merge)
2) df2.value_2A <= df1.value_1 <= df2.value_2B
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
df1
Out[13]:
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
Out[14]:
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3
desired_output
Out[15]:
id value_1 value_2A value_2B
0 1 2 NaN NaN
1 1 5 3.0 5.0
2 1 7 7.0 9.0
3 2 1 0.0 1.0
4 2 3 2.0 4.0
5 2 3 3.0 6.0
6 3 4 NaN NaN
now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.
## This is NOT a solution because memory usage is just too large and
## too many oprations deeming it extremely inefficient and slow at large scale
output = pd.merge(df1, df2, on='id', how='left') ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')
This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!
Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after.
This is done easily in R:
## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation
is there any way to do this in Python?
interesting question!
Looks like pandasql might do what you want. Please see :
How to do a conditional join in python Pandas?
Yeah. It's an annoying problem. I handled this by splitting the left DataFrame into chunks.
def merge_by_chunks(left, right, condition=None, **kwargs):
chunk_size = 1000
merged_chunks = []
for chunk_start in range(0, len(left), chunk_size):
print(f"Merged {chunk_start} ", end="\r")
merged_chunk = pd.merge(left=left[chunk_start: chunk_start+chunk_size], right=right, **kwargs)
if condition is not None:
merged_chunk = merged_chunk[condition(merged_chunk)]
merged_chunks.append(merged_chunk)
return pd.concat(merged_chunks)
Then you can provide the condition as a function.
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
def condition_func(output):
return (((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])))
output = merge_by_chunks(df1, df2, condition=condition_func, on='id', how='left')
merge_by_chunks(df1, output, on=['id', 'value_1'], how='left')
It can be pretty slow depending on the size of the DataFrame, but it doesn't run out of memory.

Python: How to replace only 0 values in a column by multiplication of 2 columns in Dataframe with a loop?

Here is my Dataframe:
df={'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]}
It should look like this:
At this point you will find that a_cost and b_cost columns have 0s where other column has a value. I would like my function to follow this logic...
for i in df.a_cost:
if i==0:
b_cost(column):value *(multiply) pack(column):value
replace 0 with this new multiplied value (example: 6.0*2=12)
for i in df_b.cost:
if i==0:
a_cost(column):value /(divide) pack(column):value
replace 0 with this new divided value (example: 10.5/2=5.25)
I can't figure out how to write this logic successfully... Here is the expected output:
Output in code:
df={'pack':[2,2,2,2], 'a_cost':[10.5,12.0,11,13.0], 'b_cost':[5.25,6,5.50,6.5]}
Help is really appreciated!
IIUC,
df.loc[df.a_cost.eq(0), 'a_cost'] = df.b_cost * df.pack
df.loc[df.b_cost.eq(0), 'b_cost'] = df.a_cost / df.pack
You can also play with mask and fillna:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0)).fillna(df.b_cost * df.pack)
df['b_cost'] = df.b_cost.mask(df.b_cost.eq(0)).fillna(df.a_cost / df.pack)
Update as commented, you can use other in mask:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0), other=df.b_cost * df.pack)
Also note that the second filtering is not needed once you already fill 0 in columns a_cost. That is, we can just do:
df['b_cost'] = df.a_cost / df.pack
after the first command in both methods.
Output:
pack a_cost b_cost
0 2 10.5 5.25
1 2 12.0 6.00
2 2 11.0 5.50
3 2 13.0 6.50
import numpy as np
df = pd.DataFrame({'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]})
df['a_cost'] = np.where(df['a_cost']==0, df['pack']*df['b_cost'], df['a_cost'])
df['b_cost'] = np.where(df['b_cost']==0, df['a_cost']/df['pack'], df['b_cost'])
print (df)
#pack a_cost b_cost
#0 2 10.5 5.25
#1 2 12.0 6.0
#2 2 11.0 5.50
#3 2 13.0 6.5
Try this:
df['a_pack'] = df.apply(lambda x: x['b_cost']*x['pack'] if x['a_cost'] == 0 and x['b_cost'] != 0 else x['a_cost'], axis = 1)
df['b_pack'] = df.apply(lambda x: x['a_cost']/x['pack'] if x['b_cost'] == 0 and x['a_cost'] != 0 else x['b_cost'], axis = 1)

Confusion about modifying DataFrame within Function

when working within a function, I am struggling with understanding why sometimes I can modify dataframe values and other times I cannot. For example, variable assignment works without a return statement, but pd.get_dummies seems to need me to use return statement at end of the function.
import numpy as np
import pandas as pd
data = {
'x' : np.linspace(0,10,3) ,
'y' : np.linspace(10,20,3) ,
'cat1' : ['dog','cat','fish'] ,
'cat2' : ['website1','website1','website2'] }
df = pd.DataFrame(data)
cat1 cat2 x y
0 dog website1 0.0 10.0
1 cat website1 5.0 15.0
2 fish website2 10.020.0
Modifies the original dataframe
def change_variable(df):
df['x'] = 999
Doesn't Modify
def ready_for_ml(df):
pd.get_dummies(df,columns=['cat1','cat2'])
My work-around:
def ready_for_ml_return_df_version(df):
df = pd.get_dummies(df,columns=['cat1','cat2'])
return df
df = ready_for_ml_return_df_version(df)
x y cat dog fish website1 website2
0 0.0 10.0 0 1 0 1 0
1 5.0 15.0 1 0 0 1 0
2 10.020.0 0 0 1 0 1
I have read about using inplace transforms as a solution. How to modify a pandas DataFrame in a function so that changes are seen by the caller? but, I pd.get_dummies doesn't seem to have an inplace argument. So, I am curious about this function but, also more broadly the underlying behavior.
Thank you for your time.

Categories