I am writing a python function that will do a leftanti join on two dataframe, and the joining condition may vary. i.e. sometime 2 DFs might have just one column as unique key for joining, and soemtime 2 DFs might have more than 1 columns to join on.
So, I have written the below code. Please suggest what changes should I make
def integraty_check(testdata, refdata, cond = []):
df = func.join_dataframe(testdata, refdata, cond, "leftanti", logger)
df = df.select(cond)
func.write_df_as_parquet_file(df, curate_path, logger)
return df
here the parameter cond may have 1 or more than 1 column names as comma separated.
So, hwo do I pass the dynamic list of column names when I am calling the function?
Please suggest what would be the best way to achieve the purpose.
you can use python's Unpacking Operator (PEP 448)
df = df.select(*cond)
You can find more examples on how to use the asterisk operator:
Packing and Unpacking Arguments in Python
I have a data frame that looks like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
org1 C
['org3', 'org4'] A
org2 A
['org2', 'org4'] B
...
When I read in this data using Pandas, the lists are read in as strings (e.g., dat['Category][0][0] returns [ rather than returning org1). I have several columns like this. I want every categorical column that already contains at least one list to have all records be a list. For example, the above data frame should look like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
['org1'] C
['org3', 'org4'] A
['org2'] A
['org2', 'org4'] B
...
Notice how the singular values in the Category column are now contained in lists. When I reference dat['Category][0][0], I'd like org1 to be returned.
What is the best way to accomplish this? I was thinking of using ast.literal_eval with an apply and lambda function, but I'd like to try and use best-practices if possible. Thanks in advance!
You could create a boolean mask of the values that need to changed. If there are no lists, no change is needed. If there are lists, you can apply literal_eval or a list creation lambda to subsets of the data.
import ast
import pandas as pd
def normalize_category(df):
is_list = df['Category'].str.startswith('[')
if is_list.any():
df.loc[is_list,'Category'] = df.loc[is_list, 'Category'].apply(ast.literal_eval)
df.loc[~is_list,'Category'] = df.loc[~is_list]['Category'].apply(lambda val: [val])
df = pd.DataFrame({"Category":["['org1', 'org2']", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
df = pd.DataFrame({"Category":["org2", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
You can do it like this:
df['Category'] = df['Category'].apply(lambda x: literal_eval(x) if x.startswith('[') else [x])
I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()
I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?
**
for i in range(len(x["Value"])):
if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
x.loc[i,"Santral_Type"] = "HES"
elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
x.loc[i,"Santral_Type"] = "TERMIK"
elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
x.loc[i,"Santral_Type"] = "RES"
else : x.loc[i,"Santral_Type"] = "SOLAR"
**
How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:
map_dict = {
'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
'RES' : ['BRS','ÇKL','DPZ']
}
inv_map_dict = {x:k for k,v in map_dict.items() for x in v}
df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')
It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:
# Default value
x["Santral_Type"] = "SOLAR"
x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"
Note that 800k can not be considered a large table when using standard pandas methods.
I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.
this is your code adapted with numpy which should run much faster than your current method.
import numpy as np
col = 'PP_Name'
conditions = [
x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
),
x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
x[col].isin(["BRS", "ÇKL", "DPZ"])]
outcomes = ["HES", "TERMIK", "RES"]
x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')
df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:
for row in df.iterrows():
if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
df['Santral_Type] = "HES"
# and so on
By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as #mcsoini suggested
the simplest method could be .values, example:
def f(x0,...xn):
return('hello or some complicated operation')
df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]
the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df.
Advantage is faster than iterrows, itertuples and apply methods.
hope it helps
I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.