I am working with a large database of deposit (~600,000 rows), and my task is to slot the deposit based on their tenor 'bucket' i.e. a 'bucket' would have lower and upper limits in days (e.g. 0-30 days, 31-60 days etc). The simplified raw data is as followed ('LCY_CURR_BALANCE' is value of deposit, 'RM' is tenor):
As the NaN in 'RM' column signify non-maturity deposits, I fill them with 0, then change the column type from float to integer, using this code line:
MIS035A['KHCL']=MIS035A['KHCL'].fillna(0)#Replace NA with 0
MIS035A['KHCL']=MIS035A['KHCL'].astype(int)
The result is as followed:
However, when I start to sum the 'LCY_CURR_BALANCE' column based on condition on 'RM', the following problem happens: if the condition ==0, the process would take ridiculously long to complete (abt 3 hours). Any other conditions would take less than 30 secs. The code I use for conditional summation is as followed:
sumif_0=MIS035A[(MIS035A["KHCL"]==0)].sum()["ACY_CURR_BALANCE"]#condition ==0
sumif_1=MIS035A[(MIS035A["KHCL"]==1)].sum()["ACY_CURR_BALANCE"]#condition ==any other number
I truly appreciate if someone can explain, or help me solve, why such issue happens. I suspects it may be because of my filling of NaN to 0. However I have not found any further explanation of the issue on the internet.
Thank you very much!
How I would tackle the problem
data_0 = MIS035A[MIS035A["KHCL"]==0]
data_1 = MIS035A[MIS035A["KHCL"]==1]
sum_0 = data_0["LCY_CURR_BALANCE"].sum()
sum_1 = data_1["LCY_CURR_BALANCE"].sum()
Since you have a huge data, it is better to subset it to run your calculations faster.
Related
I am currently working on a course in Data Science on how to win data science competitions. The final project is a Kaggle competition that we have to participate in.
My training dataset has close to 3 million rows, and one of the columns is a "date of purchase" column.
I want to calculate the distance of each date to the nearest public holiday.
E.g. if the date is 31/12/2014, the nearest PH would be 01/01/2015. The number of days apart would be "1".
I cannot think of an efficient way to do this operation. I have a list with a number of Timestamps, each one is a public holiday in Russia (the dataset is from Russia).
def dateDifference (target_date_raw):
abs_deltas_from_target_date = np.subtract(russian_public_holidays, target_date_raw)
abs_deltas_from_target_date = [i.days for i in abs_deltas_from_target_date if i.days >= 0]
index_of_min_delta_from_target_date = np.min(abs_deltas_from_target_date)
return index_of_min_delta_from_target_date
where 'russian_public_holidays' is the list of public holiday dates and 'target_date_raw' is the date for which I want to calculate distance to the nearest public holiday.
This is the code I use to create a new column in my DataFrame for the difference of dates.
training_data['closest_public_holiday'] = [dateDifference(i) for i in training_data['date']]
This code ran for nearly 25 minutes and showed no signs of completing, which is why I turn to you guys for help.
I understand that this is probably the least Pandorable way of doing things, but I couldn't really find a clean way of operating on a single column during my research. I saw a lot of people say that using the "apply" function on a single column is a bad way of doing things. I am very new to working with such large datasets, which is why clean and efficient practices seem to elude me for now. Please do let me know what would be the best way to tackle this!
Try this and see if helps with the timing. I worry that it will take up to much memory. I don't have the data to test. You can try.
df = pd.DataFrame(pd.date_range('01/01/2021','12/31/2021',freq='M'),columns=['Date'])
holidays = pd.to_datetime(np.array(['1/1/2021','12/25/2021','8/9/2021'])).to_numpy()
Assuming holidays: 1/1/2021, 8/9/2021, 12/25/2021
df['Days Away'] = (
np.min(np.absolute(df.Date.to_numpy()
.reshape(-1,1) - holidays),axis=1) /
np.timedelta64(1, 'D')
)
I am new to ML and Data Science (recently graduated from Master's in Business Analytics) and learning as much as I can by myself now while looking for positions in Data Science / Business Analytics.
I am working on a practice dataset with a goal of predicting which customers are likely to miss their scheduled appointment. One of the columns in my dataset is "Neighbourhood", which contains names of over 30 different neighborhoods. My dataset has 10,000 observations, and some neighborhood names only appear less than 50 times. I think that neighborhoods that appear less than 50 times in the dataset are too rare to be analyzed properly by machine learning models. Therefore, I want to remove the names of the neighborhoods from the "Neighborhood" column which appear in that column less than 50 times.
I have been trying to write a code for this for several hours, but struggle to get it right. So far, I have gotten to the version below:
my_df = my_df.drop(my_df["Neighbourhood"].value_counts() < 50, axis = 0)
I have also tried other versions of code to get rid of the rows in that categorical column, but I keep getting a similar error:
KeyError: '[False False ... True True] not found in axis'
I appreciate your help in advance, and thank you for sharing your knowledge and insights with me!
Try the code below - it uses the .loc operator to select rows on the basis of a certain condition (i.e. in neighborhoods with high counts)
counts = my_df['Neighborhood'].value_counts()
new_df = my_df.loc[my_df['Neighborhood'].isin(counts.index[counts > 50])]
I calculate number of quarters gap between two dates. Now, I want to test if the number of quarters gap is bigger than 2.
Thank you for your comments!
I'm actually running a code from WRDS (Wharton Research Data Services). Below, fst_vint is a DataFrame with two date variables, rdate and lag_rdate. First line seems to convert them to quarter variables (e.g., 9/8/2019 to 2019Q1), and then take differences between them, storing it in a new column qtr.
fst_vint.qtr >= 2 creates a problem, because the former is a QuarterEnd object, while the latter is an integer. How do I deal with this problem?
fst_vint['qtr'] = (fst_vint['rdate'].dt.to_period('Q')-\
fst_vint['lag_rdate'].dt.to_period('Q'))
# label first_report flag
fst_vint['first_report'] = ((fst_vint.qtr.isnull()) | (fst_vint.qtr>=2))
Using .diff() when column is converted to integer with .astype(int) results in the desired answer. So the code in your case would be:
fst_vint['qtr'] = fst_vint['rdate'].astype(int).diff()
I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.
According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.
I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.
I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.
If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position
So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.
I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.
Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:
df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)
But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).
What am I doing wrong because I feel like this is not the right approach.