Calculate Pandas dataframe column with replace function - python

I`m working on calculating a field in Pandas dataframe. Learning Python, I'm trying to find the best method.
Dataframe is quite big, over 55 mln rows. It has a few columns among which date and failure are in my interest. So the dataframe looks like this:
date failure
2018-09-09 0
2016-05-12 1
2013-12-12 1
2018-05-12 1
2018-05-12 1
I want to calculate failure_date (if failure = 1 then failure_date = date).
Tried smth. like this:
import pandas as pd
abc = pd.read_pickle('data_abc.pkl')
abc['failure_date'] = abc['failure'].replace(1, abc['date'])
The session is busy for a very long time (1.5h). No result so far. Is it a right approach?
Is the a more effective way of calculating column based on condition on others ?

This code adds a column "failure_date" and sets it to the failure date for the failures. It does not address "non-failures".
abc.loc[abc['failure']==1, 'failure_date'] = abc['date']

If you don't mind discarding the rest of the dataframe you could get all the dates where failure is 1 like this
abc = abc[abc['failure] == 1]

Related

Group pandas dataframe by a outside column

import pandas as pd
db = pd.read_csv('17base.csv')
order = pd.read_csv('order.csv')
db = db.groupby(db['code'])['r'].mean()
and the two dataframe is just like the table I drew here. But this code don't work, since I want to group every 50 "codes" here
I have a dataframe like this. I want to first group the code by the order of another dataframe which has only one column. Second, I want to construct portfolios every 50 codes, and "r" is basically the average of the same date (the date is in "year-week" format. The result is like the listed portforlio number, each date and "r". It seems a really complicated task for me and can anyone help me...
Here is a sample original table.
code
Date
1
2017-1
1
2017-2
1
2017-3
2
2017-1
2
2017-2
2
2017-3
...
...
3000
2017-3
And I have another table which is has the order code should be, like
code
6
7
564
...
That's what I can thought of to explain myself...

Python Pandas: Create a dictionary from a dataframe with values equal to a list of all row values

I am trying to create dictionary from a dictionary from a dataframe in the following way.
My dataframe contains a column called station_id. The station_id values are unique. That is each row correspond to station id. Then there is another column called trip_id (see example below). Many stations can be associated with a single trip_id. For example
l1=[1,1,2,2]
l2=[34,45,66,67]
df1=pd.DataFrame(list(zip(l1,l2)),columns=['trip_id','station_name'])
df1.head()
trip_id station_name
0 1 34
1 1 45
2 2 66
3 2 67
I am trying to get a dictionary d={1:[34,45],2:[66,67]}.
I solved it with a for loop in the following fashion.
from tqdm import tqdm
Trips_Stations={}
Trips=set(df['trip_id'])
T=list(Trips)
for i in tqdm(range(len(Trips))):
c_id=T[i]
Values=list(df[df.trip_id==c_id].stop_id)
Trips_Stations.update({c_id:Values})
Trips_Stations
My actual dataset has about 65000 rows. The above takes about 2 minutes to run. While this is acceptable for my application, I was wondering if there is a faster way to do it using base pandas.
Thanks
somehow stackoverflow suggested that I look at Group_By
This is much faster
d=df.groupby('trip_id')['stop_id'].apply(list)
from collections import OrderedDict, defaultdict
o_d=d.to_dict(OrderedDict)
o_d=dict(o_d)
It took about 30 secs for the dataframe with 65000 rows. Then

Assigning a value to the same dates fulfilling a condition in a more efficient way in a dataframe

I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.

Is there a way to create a Pandas dataframe where the values map to an index/row pair?

I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)

With pandas, how do I calculate a rolling number of events in the last second given timestamp data?

I have dataset where I calculate service times based on request and response times. I would like to add a calculation of requests in the last second to show the obvious relationship that as we get more requests per second the system slows. Here is the data that I have, for example:
serviceTimes.head()
Out[71]:
Id Req_Time Rsp_Time ServiceTime
0 3_1 2015-02-13 14:07:08.729000 2015-02-13 14:07:08.821000 00:00:00.092000
1 3_2 2015-02-13 14:07:08.929000 2015-02-13 14:07:08.929000 00:00:00
2 3_12 2015-02-13 14:11:53.908000 2015-02-13 14:11:53.981000 00:00:00.073000
3 3_14 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.250000 00:00:00.139000
4 3_15 2015-02-13 14:11:54.111000 2015-02-13 14:11:54.282000 00:00:00.171000
For this I would like a rolling data set of something like:
0 14:07:08 2
1 14:11:53 1
2 14:11:54 2
I've tried rolling_sum and rolling_count, but unless I am using them wrong or not understanding the period function, it is not working for me.
For your problem, it looks like you want to summarize your data set using a split-apply-combine approach. See here for the documentation that will help you get your code in working but basically, you'll want to do the following:
Create a new column (say, 'Req_Time_Sec that includes Req_Time down to only second resolution (e.g. 14:07:08.729000 becomes 14:07:08)
use groups = serviceTimes.groupby('Req_Time_Sec) to separate your data set into sub-groups based on which second each request occurs in.
Finally, create a new data set by calculating the length of each sub group (which represents the number of requests in that second) and aggregating the results into a single DataFrame (something like new_df = groups.aggregate(len))
The above is all untested pseudo-code, but the code, along with the link to the documentation, should help you get where you want to go.
You first need to transform the timestamp into a string which you then groupby, showing the count and average service times:
serviceTimes['timestamp'] = [t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time]
serviceTimes.groupby('timestamp')['ServiceTime'].agg(['mean', 'count'])
Alternatively, create a data frame of the request time in the appropriate string format, e.g. 15-13-15 17:27, then count the occurrence of each time stamp using value_counts(). You can also plot the results quite easily.
df = pd.DataFrame([t.strftime('%y-%m-%d %H:%M') for t in serviceTimes.Req_Time],
columns=['timestamp'])
response = df.timestamp.value_counts()
response.plot(rot=90)

Categories