I want to sort a dataframe in pandas. I wanna do it by sorting 2 columns by value counts. One depends on the other. As seen in the image, I have achieved categorical sorting. However, I want the column 'category' to be sorted by value counts. And then the dataframe is to be sorted again based on 'beneficiary_name' under the same category.
This is the code I have written to achieve this till now.
data_category = data_category.sort_values(by=['category','beneficiary_name'], ascending=False)
Please help me figure this out. Thanks.
Inspired by this related question:
Create column of value_counts in Pandas dataframe
import pandas as pd
df = pd.DataFrame({'id': range(9), 'cat': list('ababaacdc'), 'benef': list('uuuuiiiii')})
print(df)
# id cat benef
# 0 0 a u
# 1 1 b u
# 2 2 a u
# 3 3 b u
# 4 4 a i
# 5 5 a i
# 6 6 c i
# 7 7 d i
# 8 8 c i
df['cat_count'] = df.groupby(['cat'])['id'].transform('count')
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 1 1 b u 2
# 2 2 a u 4
# 3 3 b u 2
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 7 7 d i 1
# 8 8 c i 2
df = df.sort_values(by=['cat_count', 'cat', 'benef'], ascending=False)
print(df)
# id cat benef cat_count
# 0 0 a u 4
# 2 2 a u 4
# 4 4 a i 4
# 5 5 a i 4
# 6 6 c i 2
# 8 8 c i 2
# 1 1 b u 2
# 3 3 b u 2
# 7 7 d i 1
Related
I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1
I have two dataframes, df_diff and df_three. For each column of df_three, it contains the index values of three largest values from each column of df_diff. For example, let's say df_diff looks like this:
A B C
0 4 7 8
1 5 5 7
2 8 2 1
3 10 3 4
4 1 12 3
Using
df_three = df_diff.apply(lambda s: pd.Series(s.nlargest(3).index))
df_three would look like this:
A B C
0 3 4 0
1 2 0 1
2 1 1 3
How could I match the index values in df_three to the column values of df_diff? In other words, how could I get df_three to look like this:
A B C
0 10 12 8
1 8 7 7
2 5 5 4
Am I making this problem too complicated? Would there be an easier way?
Any help is appreciated!
def top_3(s, top_values):
res = s.sort_values(ascending=False)[:top_values]
res.index = range(top_values)
return res
res = df.apply(lambda x: top_3(x, 3))
print(res)
Use numpy.sort with dataframe values:
n=3
arr = df.copy().to_numpy()
df_three = pd.DataFrame(np.sort(arr, 0)[::-1][:n], columns=df.columns)
print(df_three)
A B C
0 10 12 8
1 8 7 7
2 5 5 4
I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3
Short version: I have a slightly trickier than usual merge operation I'd like help optimizing with dplyr or merge. I have a number of solutions already, but these run quite slow over large datasets and I am curious if there exist a faster method in R (or in SQL or python alternatively)
I have two data.frames:
a asynchronous log of events tied to Stores, and
a table that gives more details about the stores in that log.
The issue: Store IDs are unique identifiers for a specific location, but store locations may change ownership from one period to the next (and just for completeness, no two owners may possess the same store at the same time). So when I merge over store level info, I need some sort of conditional that merges store-level info for the correct period.
Reproducible Example:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
Now, I only want to merge over information in Store d.f. to log, if that Store is open for business in that period (t). CloseDate and StartDate indicate the last and first periods of this business's operation, respectively. (For completeness but not too important, with StartDate 0 the store existed since before the sample. For CloseDate 9 the store hadn't gone out of business at that location by the end of the sample.)
One solution relies on a period t level split() and dplyr::rbind_all(), e.g.
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
Functionally, this appears to work (haven't come across a significant error yet anyway). However we are dealing with (increasingly usual) billions of rows of log data.
I made a larger reproducible example on sense.io if you'd like to use it for bench-marking. See here: https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
Two questions:
First and foremost, is there another way to approach this problem using similar methods that will run faster?
Is there by chance a quick and easy solution in SQL and Python (of which I am not quite as familiar, but could rely on if need be).
Also, can you help me articulate this question in a more general, abstract way? Right now I only know how to talk about the problem in context specific terms, but I'd love to be able to talk about these types of issues with more appropriate, but more general programming or data manipulation terminologies.
In R, You could take a look at the data.table::foverlaps function
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e
You can transform your Stores data.frame adding t-column, which contains all values of t for a definite Store and then use unnest function from Hadley's tydir package to transform it to "long" form.
require("tidyr")
require("dplyr")
complxMerge_v2 <- function(df, Stores, by = NULL) {
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
unnest(t) %>% left_join(df, ., by = by)
}
complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")
microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962 10
# complxMerge_v2(df, Stores) 532.744 539.743 567.7207 561.9635 588.0637 636.5775 10
Here are step-by-step results to make the process clear.
Stores_with_t <-
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 0 2 2 b 0, 1, 2
# 3 0 3 3 c 0, 1, 2, 3
# 4 4 2 9 d 4, 5, 6, 7, 8, 9
# 5 4 3 9 e 4, 5, 6, 7, 8, 9
# After that `unnest(t)`
Stores_with_t_unnest <-
with_t %>% unnest(t)
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0
# 2 0 1 9 a 1
# 3 0 1 9 a 2
# 4 0 1 9 a 3
# 5 0 1 9 a 4
# 6 0 1 9 a 5
# 7 0 1 9 a 6
# 8 0 1 9 a 7
# 9 0 1 9 a 8
# 10 0 1 9 a 9
# 11 0 2 2 b 0
# 12 0 2 2 b 1
# 13 0 2 2 b 2
# 14 0 3 3 c 0
# 15 0 3 3 c 1
# 16 0 3 3 c 2
# 17 0 3 3 c 3
# 18 4 2 9 d 4
# 19 4 2 9 d 5
# 20 4 2 9 d 6
# 21 4 2 9 d 7
# 22 4 2 9 d 8
# 23 4 2 9 d 9
# 24 4 3 9 e 4
# 25 4 3 9 e 5
# 26 4 3 9 e 6
# 27 4 3 9 e 7
# 28 4 3 9 e 8
# 29 4 3 9 e 9
# And then simple `left_join`
left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df