Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm doing a neural network project in which there is a feature that is the time passed from the user's last activity until some specific times. For example imagine that there is a list of times (March 15th, April 1st, April 24th, etc) and we want to find the interval between each user last activity before any of those specific times and the specific time. To be more clear, image user1 has some actions on March 10th, March 13th and March 24th, the values for him/her according to March 15th would be 2 days (March 13th). Now what if the user has no actions before March 15th?
Now due to some algorithms, I am joining some temp tables which result in having lots of NaN. How to tell the network that these cells should not be considered?
edit1
the code to fill the cell is this:
for aciton_time in all_aciton_times:
interval_tmp = actions_df.loc[(actions_df['when'] < aciton_time)].drop_duplicates(subset="device_id", keep='last')
interval_tmp['aciton_' + str(aciton_time)] = interval_tmp['when'].apply(lambda x: aciton_time - x)
del interval_tmp['when']
interval = interval.merge(interval_tmp, on="device_id", how="outer")
previous_aciton_time = aciton_time
and the result is something like this:
thanks
If you have a large dataset you could drop any rows that have NaN values
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 11 months ago.
Improve this question
i'm planning to predict 2023 working hours and days for my employees , each employee have
DaysCount= number of days he worked , also HoursWorked= hours he spent in work . The question is ,it's possible ? can i predict next year's stats based on these 2 values ?
any ideas ?
Yes, you can, but maybe, It's will be a imprecise solution. But, I'll try to help you to find a solution.
Number of attributes: first of all, a machine learning model requires good features about the person historic to make predictions about their future, use only two attributes could be not the sufficiente to do accurate predictions.
Size of training: another important aspect is the training size, for example, do you have a long historic for all employees? For example, do you have the individual historic, for each employee, for their last decade?
Modelling: An important aspect that you should think is about the modelling. How do you want to train the model? For example, do you will use the january's employee historic to predict the february? Or, do you will use the 2021 employee's historic to predict 2022, and use this model to predict 2023? Do you have any others features to feed the model? What explain the number of days and hours worked? For example, the hours worked could be explained by the day week, this means that, if you work in a restaurant (for example), you can work more in the weekends. So, it's important to model know the week day. The days count could be afect by the vacations/hollydays, so, include these informations in the data should be very important. After that, you must consider how you will split the data into training and test. Do you will use the historic of the last seven-days to predict the next one day? Or, you will use the last month historic to predict the next month?
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a pandas DataFrame where there is a column which contains the new number of covid cases per day.
After plotting that I get this graph :
Now I want to find out at what rate the cases are growing so how can I do this ?
The rate at which cases grow will be: (cases in current day - cases in previous day) / cases in previous day
There are several ways to do this. The easiest is to use df.pct_change()
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am looking to keep track of customers that are going to churn in 2019 in the order data of 2018 so that I can do some analyses such as where the customers come from, if their order size has been decreasing compared to customers that will not churn.
The orderdata in 2018 is a pandas df called 'order_data' and I have a list of customers that will churn in 2019 called 'churn_customers_2019'. In order_data there is a column called Customer_id. The list is also filled with Customer_id names of the clients that will churn.
However my logic is not running well.
order_data['churn in 2019?'] = str('N')
for x in order_data['Customer_id']:
if x in churn_customers_2019:
order_data['churn in 2019?'][x] = 'Y'
If I run this code everything changes to N instead of also having some Y. Only about 10% of the customers churn.
I would suggest using np.where and isin for your problem, likewise:
order_data['churn in 2019?'] = np.where(order_data['Customer_id'].isin(churn_customers_2019), 'Y', 'N')
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Pandas: Kindly need to find out repeated problem for same customer Note: problem consider repeated if only occurred within 30 days with same code
Lets try group by Customer ID and Problem code and find the consecutive differences in dates within each group. Convert the time delata into days and check if the resultant absolute value is less than or equal to 30.
However, pay serious attention to comments posted above
df['Date']=pd.to_datetime(df['Date'])# Coerce date to datetime
df[abs(df.groupby(['CT_ID','Problem_code'])['Date'].diff().dt.days).le(30)]
CT_ID Problem_code Date
3 XO1 code_1 2021-01-03 11:35:00
5 XO3 code_4 2020-09-20 09:35:00
8 XO3 code_4 2020-10-10 11:35:00
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In my dataset there is an independent column called "Cycle". It has date values written in text format. I am not understanding how to convert it into numbers. I am working with Multiple Linear Regression and Python. The Column looks like this. Any idea regarding this.
Cycle
10th June to 11th July
20th June to 21st July
17th June to 18th July
Any idea regarding this
Disclaimer: Since you question is broad and quite vague on details answer is aimed at only at pointing you where to research and some general terms around it.
This is example of categorical data. In a nutshell you can do several things with it, here are presented some ideas:
If categories are fixed and you know all possible values you can convert them to numerical values by assigning each of them incremental (or new random) number.
If your categories are not known in advance you can convert them to hashed category. As a variation of this approach you can hash only most frequent categories and summary reduce outliers to smaller number of hash values reducing total amount of categories used.
You can bucketize them, depending on your expected impact and here are just some ideas:
bucketize in month chunks:
bucketize in quartal chunks
bucketize in weeks chunks
Finally you can transform it to more detailed representation, getting
additional value out of it like so (this is just an example):
Cycle C_Start_Day C_Start_Month C_End_Day C_End_Month C_Num_Days
10th June to 11th July 10 6 11 7 1
20th June to 21st July 20 6 21 7 1
17th June to 18th July 17 6 18 7 1
Note: based on your previous comment I'd suggest using last approach (transformation). that way you can extract from Cycle column all the data you need for further numerical processing.