I am having trouble creating a couple new calculated columns to my Dataframe. Here is what I'm looking for:
Original DF:
Col_IN Col_OUT
5 2
1 2
2 2
3 0
3 1
What I want to add is two columns. One is 'running end of day total' that takes in the net of the current day plus total of day before. Second column I want 'Available Units' - which factors in the previous day end total plus incoming units. Desired result:
Desired DF:
Col_IN Available_Units Col_OUT End_Total
5 5 2 3
1 4 2 2
2 4 2 2
3 5 0 5
3 8 1 7
It's a weird one - anybody have an idea? Thanks.
For the End_Total you can use np.cumsum and for Available Units you can use shift.
df = pd.DataFrame({
'Col_IN': [5,1,2,3,3],
'Col_OUT': [2,2,2,0,1]
})
df['End_Total'] = np.cumsum(df['Col_IN'] - df['Col_OUT'])
df['Available_Units'] = df['End_Total'].shift().fillna(0) + df['Col_IN']
print(df)
will output
Col_IN Col_OUT End_Total Available_Units
0 5 2 3 5.0
1 1 2 2 4.0
2 2 2 2 4.0
3 3 0 5 5.0
4 3 1 7 8.0
Running totals are also known as cumulative sums, for which pandas has the cumsum() function.
The end totals can be calculated through the cumulative sum of incoming minus the cumulative sum of outgoing:
df["End_Total"] = df["Col_IN"].cumsum() - df["Col_OUT"].cumsum()
The available units can be calculated in the same way, if you shift the outgoing column one down:
df["Available_Units"] = df["Col_IN"].cumsum() - df["Col_OUT"].shift(1).fillna(0).cumsum()
Related
I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1
I need to find month by month way of showing year to date unique values. For example:
month value
1 a
1 b
1 a
2 a
2 a
2 a
3 c
3 b
3 b
4 d
4 e
4 f
Should output:
Month Monthly unique Year to date unique
1 2 2
2 1 2
3 2 3
4 3 6
For monthly unique it is just a matter of group by and unique(), but it won't work for year-to-date this way. Year-to-date may be achieved by using for loop and filtering dataframe month by month since the beginning of the year, but it's slow, non-pythonic way I want to omit.
How to do it in efficient way?
Let us do
s = df.groupby('month').value.agg(['nunique',list])
s['list'] = s['list'].cumsum().map(lambda x : len(set(x)))
s
nunique list
month
1 2 2
2 1 2
3 2 3
4 3 6
BEN_YO's approach is pretty simple and effective for small datasets. However, it can be slow and costly on big dataframe due to cumsum on lists (of strings).
Let's try drop_duplicates first and only work on duplicates:
(df.drop_duplicates(['month','value'])
.assign(year=lambda x: ~x.duplicated(['value']))
.groupby('month')
.agg({'value':'nunique', 'year':'sum'})
.assign(year=lambda x: x.year.cumsum())
)
Output:
value year
month
1 2 2
2 1 2
3 2 3
4 3 6
I would like to summarize column from a csv file. Pretty much extract column data and match it up with relevant ratings and count.
Also, any idea how should I match the expected dataframe with the website image?
website rate
1 two 5
2 two 3
3 two 5
4 one 2
5 one 4
6 one 4
7 one 2
8 one 2
9 two 2
website rate(over 5) count appeal(rate over 5 / count >= 0.5)
one 0 5 0
two 2 4 1
You can use a groupby operation:
res = df.assign(rate_over_5=df['rate'].ge(5))\
.groupby('website').agg({'rate_over_5': ['sum', 'size']})\
.xs('rate_over_5', axis=1).reset_index()
res['appeal'] = ((res['sum'] / res['size']) >= 0.5).astype(int)
print(res)
website sum size appeal
0 one 0.0 5 0
1 two 2.0 4 1
I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1
I've got a "Perf" dataframe with people performance data over time.
The index is a timstamp and the columns are the persons name.
There are 100 persons (columns) and each person belongs to one of 10 groups, however the group assignment is dynamic, everyday each person could be assigned to a different group.
So there is a second "Group" DataFrame of the same shape than "Perf" that contains group number (0-9) for each timestamp and person.
The question is how can I elegantly do a mean subtraction everyday for each person with regards to its group assignment?
One method that is really slow is:
for g in range(10):
Perf[Group==g] -= Perf[Group==g].mean(1)
But this is really slow, I'm sure there is a way to do it in one shot with Pandas.
here is an concrete example:
scores represents the score for each person (0-4) for 10 days (0-9):
In [8]: perf = DataFrame(np.random.randn(10,5))
In [9]: perf
Out[9]:
0 1 2 3 4
0 0.945575 -0.805883 1.338865 0.420829 -1.074329
1 -1.086116 0.430230 1.296153 0.527612 1.269646
2 0.705276 -1.409828 2.859838 -0.769508 1.520295
3 0.331860 -0.217884 0.962576 -0.495888 -1.083996
4 0.402625 0.018885 -0.260516 -0.547802 -0.995959
5 2.168944 -0.361657 0.184537 0.391014 0.972161
6 1.959699 0.590739 -0.781736 1.059761 1.080997
7 2.090273 -2.446399 0.553785 0.806368 -0.786343
8 0.441160 -2.320302 -1.981387 2.190607 0.345626
9 -0.276013 -1.319214 1.339096 0.269680 -0.509884
Then I've got some grouping dataframe that for each day shows the group association of each one of the 5 persons, the grouping changes everyday.
In [20]: grouping
Out[20]:
0 1 2 3 4
0 3 1 2 1 2
1 3 1 2 2 1
2 2 2 3 1 1
3 1 2 2 3 1
4 3 2 1 2 1
5 2 1 1 2 3
6 1 2 1 2 3
7 2 2 1 1 3
8 2 1 2 1 3
9 1 3 2 1 2
I want to modify Perf such that for each day I subtract for each person the mean score of its group.
for example for day 0 it will be 0.0 -0.613356 1.206597 0.613356 -1.206597
I want to do it in 1 line without loops. Groupby seems to be the function to use, but I couldn't use efficiently its output to perform the mean subtract operation on the original matrix.