Modify 2 columns in dataframe based on condition - python

I have seen a bunch of examples on Stack Overflow on how to modify a single column in a dataframe based on a condition, but I cannot figure out how to modify multiple columns based on a single condition.
If I have a dataframe generated based on the below code -
import random
import pandas as pd
random_events = ('SHOT', 'MISSED_SHOT', 'GOAL')
events = list()
for i in range(6):
event = dict()
event['event_type'] = random.choice(random_events)
event['coords_x'] = round(random.uniform(-100, 100), 2)
event['coords_y'] = round(random.uniform(-42.5, 42.5), 2)
events.append(event)
df = pd.DataFrame(events)
print(df)
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 -2.46 -20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 -87.24 34.40 GOAL
5 -96.10 30.41 GOAL
What I want to accomplish is the following (in pseudo-code) on each row of the DataFrame -
if df['coords_x'] < 0:
df['coords_x'] * -1
df['coords_y'] * -1
Is there a way to do this via an df.apply() function that I am missing?
Thank you in advance for your help!

IIUC, you can do this with loc, avoiding the need for apply:
>>> df
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 -2.46 -20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 -87.24 34.40 GOAL
5 -96.10 30.41 GOAL
>>> df.loc[df.coords_x < 0, ['coords_x', 'coords_y']] *= -1
>>> df
coords_x coords_y event_type
0 4.07 -21.75 GOAL
1 2.46 20.99 SHOT
2 99.45 -15.09 MISSED_SHOT
3 78.17 -10.17 GOAL
4 87.24 -34.40 GOAL
5 96.10 -30.41 GOAL

Related

Checking Dataframe for Offsetting Values

I have a list of transactions that lists the matter, the date, and the amount. People entering the data often make mistakes and have to reverse out costs by entering a new cost with a negative amount to offset the error. I'm trying to identify both reversal entries and the entry being reversed by grouping my data according to matter number and work date and then comparing Amounts.
The data looks something like this:
MatterNum
WorkDate
Amount
1
1/02/2022
10
1
1/02/2022
15
1
1/02/2022
-10
2
1/04/2022
15
2
1/05/2022
-5
2
1/05/2022
5
So my output table would look like this:
|MatterNum|WorkDate|Amount|Reversal?|
|---------|--------|------|---------|
|1|1/02/2022|10|yes|
|1|1/02/2022|15|no|
|1|1/02/2022|-10|yes|
|2|1/04/2022|15|no|
|2|1/05/2022|-5|yes|
|2|1/05/2022|5|yes|
Right now, i'm using the following code to check each row:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[2,'1/4/2022',12],
[2,'1/5/2022',-5],
[2,'1/5/2022',5]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
def rev_check(MatterNum, workDate, WorkAmt, df):
funcDF = df.loc[(df['MatterNum'] == MatterNum) & (df['WorkDate'] == workDate)]
listCheck = funcDF['Amount'].tolist()
if WorkAmt*-1 in listCheck:
return 'yes'
df['reversal?'] = df.apply(lambda row: rev_check(row.MatterNum, row.WorkDate, row.Amount, df), axis=1)
This seems to work, but it is pretty slow. I need to check millions of rows of data. Is there a better way I can approach this that would be more efficient?
If I assume that a "reversal" is when this row's amount is less than the previous row's amount, then pandas can do this with diff:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[1,'1/2/2022',12]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
print(df)
df['Reversal'] = df['Amount'].diff()<0
print(df)
Output:
MatterNum WorkDate Amount
0 1 1/2/2022 10
1 1 1/2/2022 15
2 1 1/2/2022 -10
3 1 1/2/2022 12
MatterNum WorkDate Amount Reversal
0 1 1/2/2022 10 False
1 1 1/2/2022 15 False
2 1 1/2/2022 -10 True
3 1 1/2/2022 12 False
The first row has to be special-cased, since there's nothing to compare against.

Conditional merge / join of two large Pandas DataFrames with duplicated keys based on values of multiple columns - Python

I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.
The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:
1) df1.id = df2.id (usual case of merge)
2) df2.value_2A <= df1.value_1 <= df2.value_2B
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
df1
Out[13]:
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
Out[14]:
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3
desired_output
Out[15]:
id value_1 value_2A value_2B
0 1 2 NaN NaN
1 1 5 3.0 5.0
2 1 7 7.0 9.0
3 2 1 0.0 1.0
4 2 3 2.0 4.0
5 2 3 3.0 6.0
6 3 4 NaN NaN
now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.
## This is NOT a solution because memory usage is just too large and
## too many oprations deeming it extremely inefficient and slow at large scale
output = pd.merge(df1, df2, on='id', how='left') ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')
This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!
Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after.
This is done easily in R:
## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation
is there any way to do this in Python?
interesting question!
Looks like pandasql might do what you want. Please see :
How to do a conditional join in python Pandas?
Yeah. It's an annoying problem. I handled this by splitting the left DataFrame into chunks.
def merge_by_chunks(left, right, condition=None, **kwargs):
chunk_size = 1000
merged_chunks = []
for chunk_start in range(0, len(left), chunk_size):
print(f"Merged {chunk_start} ", end="\r")
merged_chunk = pd.merge(left=left[chunk_start: chunk_start+chunk_size], right=right, **kwargs)
if condition is not None:
merged_chunk = merged_chunk[condition(merged_chunk)]
merged_chunks.append(merged_chunk)
return pd.concat(merged_chunks)
Then you can provide the condition as a function.
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
def condition_func(output):
return (((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])))
output = merge_by_chunks(df1, df2, condition=condition_func, on='id', how='left')
merge_by_chunks(df1, output, on=['id', 'value_1'], how='left')
It can be pretty slow depending on the size of the DataFrame, but it doesn't run out of memory.

Pandas group by cumsum of lists - Preparation for lstm

Using the same example from here but just changing the 'A' column to be something that can easily be grouped by:
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df["A"] = pd.Series([1]*3+ [2]*8)
df.head()
whose output now is:
Date A B C D E F G
0 2008-03-18 1 164.93 114.73 26.27 19.21 28.87 63.44
1 2008-03-19 1 164.89 114.75 26.22 19.07 27.76 59.98
2 2008-03-20 1 164.63 115.04 25.78 19.01 27.04 59.61
3 2008-03-25 2 163.92 114.85 27.41 19.61 27.84 59.41
4 2008-03-26 2 163.45 114.84 26.86 19.53 28.02 60.09
5 2008-03-27 2 163.46 115.40 27.09 19.72 28.25 59.62
6 2008-03-28 2 163.22 115.56 27.13 19.63 28.24 58.65
Doing the cumulative sums (code from the linked question) works well when we're assuming it's a single list:
# Put your inputs into a single list
input_cols = ["B", "C"]
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors1'] = df["single_input_vector"].cumsum()
but how do I cumsum the lists in this case grouped by 'A'? I expected this to work but it doesnt:
df['cumu'] = df.groupby("A")["single_input_vector"].apply(lambda x: list(x)).cumsum()
Instead of [[164.93, 114.73, 26.27], [164.89, 114.75, 26.... I get some rows filled in, others are NaN's. This is what I want (cols [B,C] accumulated into groups of col A):
A cumu
0 1 [[164.93,114.73], [164.89,114.75], [164.63,115.04]]
0 2 [[163.92,114.85], [163.45,114.84], [163.46,115.40], [163.22, 115.56]]
Also, how do I do this in an efficient manner? My dataset is quite big (about 2 million rows).
It doesn't look like your doing arithmetic sum, more like a concat along axis=1
First groupby and concat
temp_series = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])])
0 [[164.93, 114.73], [164.89, 114.75], [164.63, ...
1 [[163.92, 114.85], [163.45, 114.84], [163.46, ...
then convert back to a dataframe
df = temp_series.reset_index().rename(columns={0: 'cumsum'})
In one line
df = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])]).reset_index().rename(columns={0: 'cumsum'})

Python: How to replace only 0 values in a column by multiplication of 2 columns in Dataframe with a loop?

Here is my Dataframe:
df={'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]}
It should look like this:
At this point you will find that a_cost and b_cost columns have 0s where other column has a value. I would like my function to follow this logic...
for i in df.a_cost:
if i==0:
b_cost(column):value *(multiply) pack(column):value
replace 0 with this new multiplied value (example: 6.0*2=12)
for i in df_b.cost:
if i==0:
a_cost(column):value /(divide) pack(column):value
replace 0 with this new divided value (example: 10.5/2=5.25)
I can't figure out how to write this logic successfully... Here is the expected output:
Output in code:
df={'pack':[2,2,2,2], 'a_cost':[10.5,12.0,11,13.0], 'b_cost':[5.25,6,5.50,6.5]}
Help is really appreciated!
IIUC,
df.loc[df.a_cost.eq(0), 'a_cost'] = df.b_cost * df.pack
df.loc[df.b_cost.eq(0), 'b_cost'] = df.a_cost / df.pack
You can also play with mask and fillna:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0)).fillna(df.b_cost * df.pack)
df['b_cost'] = df.b_cost.mask(df.b_cost.eq(0)).fillna(df.a_cost / df.pack)
Update as commented, you can use other in mask:
df['a_cost'] = df.a_cost.mask(df.a_cost.eq(0), other=df.b_cost * df.pack)
Also note that the second filtering is not needed once you already fill 0 in columns a_cost. That is, we can just do:
df['b_cost'] = df.a_cost / df.pack
after the first command in both methods.
Output:
pack a_cost b_cost
0 2 10.5 5.25
1 2 12.0 6.00
2 2 11.0 5.50
3 2 13.0 6.50
import numpy as np
df = pd.DataFrame({'pack':[2,2,2,2], 'a_cost':[10.5,0,11,0], 'b_cost':[0,6,0,6.5]})
df['a_cost'] = np.where(df['a_cost']==0, df['pack']*df['b_cost'], df['a_cost'])
df['b_cost'] = np.where(df['b_cost']==0, df['a_cost']/df['pack'], df['b_cost'])
print (df)
#pack a_cost b_cost
#0 2 10.5 5.25
#1 2 12.0 6.0
#2 2 11.0 5.50
#3 2 13.0 6.5
Try this:
df['a_pack'] = df.apply(lambda x: x['b_cost']*x['pack'] if x['a_cost'] == 0 and x['b_cost'] != 0 else x['a_cost'], axis = 1)
df['b_pack'] = df.apply(lambda x: x['a_cost']/x['pack'] if x['b_cost'] == 0 and x['a_cost'] != 0 else x['b_cost'], axis = 1)

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories