Suppose that you want construct a pd.DataFrame and you want to get different numbers every-time you increase replicate number in it. (Please Scroll down for Reproducible example in R)
I would like to get same output with Python but I dont know how to get there!
If you consider this simple pd.Dataframe
df = pd.DataFrame({
'a':[np.random.normal(0.27,0.01,5),np.random.normal(1,0.01,5)]})
df
a
0 [0.268297564096, 0.252974100195, 0.27613413347...
1 [0.996267313891, 1.00497494738, 1.022271644, 1...
I dont know why the data look like this. When I do only one np.random.normal I am getting this,
a
0 0.092309
1 0.085985
2 0.083635
3 0.081582
4 0.104096
Sorry, I cannot explain this behaviour.I am new in pandas maybe you could explain this.
Ok, lets get back to original question;
If you want to generate second group of numbers and I guess I should use np.repeat
df = pd.DataFrame({['a':np.repeat(np.random.normal(0.10,0.01,5),np.random.normal(0.10,0.01,5)])})
df
Out[59]:
a
0 0.090305
1 0.090305
2 0.109092
3 0.109092
4 0.101706
5 0.101706
6 0.087357
7 0.087357
8 0.099094
9 0.099094
10 0.101595
11 0.101595
12 0.100343
13 0.100343
14 0.085380
15 0.085380
16 0.102118
17 0.102118
18 0.107328
19 0.107328
But np.repeat is just generating the same numbers twice is not the output what I want.
here is the approach in R case,
df <- data.frame(y = do.call(c,replicate(n = 2,
expr = c(rnorm(5,0.10,0.01),rnorm(5,1,0.01)),
simplify = FALSE)),gr = rep(seq(1,2),each=10))
y gr
1 0.11300203 1
2 0.11840556 1
3 0.09420799 1
4 0.10480623 1
5 0.08561427 1
6 1.00076001 1
7 1.00035891 1
8 1.00936751 1
9 1.00050563 1
10 1.00564799 1
11 0.09415217 2
12 0.10794155 2
13 0.11534605 2
14 0.08806740 2
15 0.12394189 2
16 0.99330066 2
17 0.98254134 2
18 0.99828079 2
19 1.00786526 2
20 0.97864180 2
Basically in R you can do this in pretty straightforward. But I guess in python one has to write a function for it.
In R you can generate normal distribution of numbers with rnorm and on numpy we can do that with np.random.normal. But I could not find any built in function for especially do.call.
Actually, in R you do not need do.call():
set.seed(95)
df <- data.frame(y = c(rnorm(10,0.10,0.01), rnorm(10,1,0.01)),
gr = c(rep(0,10), rep(1,10)))
df
# y gr
# 1 0.08970880 1
# 2 0.08384474 1
# 3 0.09972121 1
# 4 0.09678872 1
# 5 0.11880371 1
# 6 0.10696807 1
# 7 0.09135123 1
# 8 0.08925115 1
# 9 0.10994412 1
# 10 0.09769954 1
# 11 1.01486420 2
# 12 1.01533145 2
# 13 1.01454184 2
# 14 0.99125878 2
# 15 0.98222886 2
# 16 1.00128867 2
# 17 0.97588819 2
# 18 0.98216944 2
# 19 0.99982671 2
# 20 0.99090591 2
And with Python pandas/numpy, consider concatenating arrays using np.concatenate
import pandas as pd
import numpy as np
np.random.seed(89)
df = pd.DataFrame({'y': np.concatenate([np.random.normal(0.1,0.01,10),
np.random.normal(1,0.01,10)]),
'gr': [1]*10 + [2]*10})
print(df)
# gr y
# 0 1 0.083063
# 1 1 0.099979
# 2 1 0.095741
# 3 1 0.097444
# 4 1 0.096942
# 5 1 0.100405
# 6 1 0.099316
# 7 1 0.087978
# 8 1 0.098175
# 9 1 0.091204
# 10 2 0.997568
# 11 2 1.006740
# 12 2 1.003449
# 13 2 0.993747
# 14 2 0.997935
# 15 2 0.991284
# 16 2 0.991299
# 17 2 1.003981
# 18 2 0.993347
# 19 2 1.001337
Not sure if this is what you wanted, but you could use a for loop and generate the second set of random numbers as shown below.
df = pd.DataFrame.from_items([('a' , np.append([np.random.normal(0.10,0.01,5) for _ in xrange(2)],
[np.random.normal(1,0.01,5) for _ in xrange(2)]
))])
df is then
a
0 0.105469
1 0.091046
2 0.091626
3 0.104579
4 0.110971
5 0.076754
6 0.104674
7 0.096062
8 0.103571
9 0.089955
10 0.978489
11 0.997081
12 1.009864
13 1.000333
14 0.998483
15 1.010685
16 1.004473
17 1.001833
18 1.007723
19 0.999845
Related
How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
I have a dataframe like this:
abc
9 32.242063
3 24.419279
8 25.464011
6 25.029761
10 18.851918
2 26.027582
1 27.885187
4 20.141231
5 31.179138
7 22.893074
11 31.640625
0 33.150434
I want to subtract the first row from 100, then subtract the 2nd row from the remaining value from (100 - first row) and so on.
I tried:
a = 100 - df["abc"]
but everytime it is subtracting it from 100.
can anybody suggest the correct way to do it?
It seems you need:
df['new'] = 100 - df['abc'].cumsum()
print (df)
abc new
9 32.242063 67.757937
3 24.419279 43.338658
8 25.464011 17.874647
6 25.029761 -7.155114
10 18.851918 -26.007032
2 26.027582 -52.034614
1 27.885187 -79.919801
4 20.141231 -100.061032
5 31.179138 -131.240170
7 22.893074 -154.133244
11 31.640625 -185.773869
0 33.150434 -218.924303
Option 1
np.cumsum -
df["abc"] = 100 - np.cumsum(df.abc.values)
df
abc
9 67.757937
3 43.338658
8 17.874647
6 -7.155114
10 -26.007032
2 -52.034614
1 -79.919801
4 -100.061032
5 -131.240170
7 -154.133244
11 -185.773869
0 -218.924303
This is faster than pd.Series.cumsum in the other answer.
Option 2
Loopy equivalent, cythonized.
%load_ext Cython
%%cython
def foo(r):
x = [100 - r[0]]
for i in r[1:]:
x.append(x[-1] - i)
return x
df['abc'] = foo(df['abc'])
df
abc
9 66.849566
3 42.430287
8 16.966276
6 -8.063485
10 -26.915403
2 -52.942985
1 -80.828172
4 -100.969403
5 -132.148541
7 -155.041615
11 -186.682240
0 -219.832674
Short version: I have a slightly trickier than usual merge operation I'd like help optimizing with dplyr or merge. I have a number of solutions already, but these run quite slow over large datasets and I am curious if there exist a faster method in R (or in SQL or python alternatively)
I have two data.frames:
a asynchronous log of events tied to Stores, and
a table that gives more details about the stores in that log.
The issue: Store IDs are unique identifiers for a specific location, but store locations may change ownership from one period to the next (and just for completeness, no two owners may possess the same store at the same time). So when I merge over store level info, I need some sort of conditional that merges store-level info for the correct period.
Reproducible Example:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
Now, I only want to merge over information in Store d.f. to log, if that Store is open for business in that period (t). CloseDate and StartDate indicate the last and first periods of this business's operation, respectively. (For completeness but not too important, with StartDate 0 the store existed since before the sample. For CloseDate 9 the store hadn't gone out of business at that location by the end of the sample.)
One solution relies on a period t level split() and dplyr::rbind_all(), e.g.
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
Functionally, this appears to work (haven't come across a significant error yet anyway). However we are dealing with (increasingly usual) billions of rows of log data.
I made a larger reproducible example on sense.io if you'd like to use it for bench-marking. See here: https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
Two questions:
First and foremost, is there another way to approach this problem using similar methods that will run faster?
Is there by chance a quick and easy solution in SQL and Python (of which I am not quite as familiar, but could rely on if need be).
Also, can you help me articulate this question in a more general, abstract way? Right now I only know how to talk about the problem in context specific terms, but I'd love to be able to talk about these types of issues with more appropriate, but more general programming or data manipulation terminologies.
In R, You could take a look at the data.table::foverlaps function
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e
You can transform your Stores data.frame adding t-column, which contains all values of t for a definite Store and then use unnest function from Hadley's tydir package to transform it to "long" form.
require("tidyr")
require("dplyr")
complxMerge_v2 <- function(df, Stores, by = NULL) {
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
unnest(t) %>% left_join(df, ., by = by)
}
complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")
microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962 10
# complxMerge_v2(df, Stores) 532.744 539.743 567.7207 561.9635 588.0637 636.5775 10
Here are step-by-step results to make the process clear.
Stores_with_t <-
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 0 2 2 b 0, 1, 2
# 3 0 3 3 c 0, 1, 2, 3
# 4 4 2 9 d 4, 5, 6, 7, 8, 9
# 5 4 3 9 e 4, 5, 6, 7, 8, 9
# After that `unnest(t)`
Stores_with_t_unnest <-
with_t %>% unnest(t)
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0
# 2 0 1 9 a 1
# 3 0 1 9 a 2
# 4 0 1 9 a 3
# 5 0 1 9 a 4
# 6 0 1 9 a 5
# 7 0 1 9 a 6
# 8 0 1 9 a 7
# 9 0 1 9 a 8
# 10 0 1 9 a 9
# 11 0 2 2 b 0
# 12 0 2 2 b 1
# 13 0 2 2 b 2
# 14 0 3 3 c 0
# 15 0 3 3 c 1
# 16 0 3 3 c 2
# 17 0 3 3 c 3
# 18 4 2 9 d 4
# 19 4 2 9 d 5
# 20 4 2 9 d 6
# 21 4 2 9 d 7
# 22 4 2 9 d 8
# 23 4 2 9 d 9
# 24 4 3 9 e 4
# 25 4 3 9 e 5
# 26 4 3 9 e 6
# 27 4 3 9 e 7
# 28 4 3 9 e 8
# 29 4 3 9 e 9
# And then simple `left_join`
left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
I have got a pd.DataFrame
Time Value
a 1 1 1
2 2 5
3 5 7
b 1 1 5
2 2 9
3 10 11
I want to multiply the column Value with the column Time - Time(t-1) and write the result to a column Product, starting with row b, but separately for each top level index.
For example Product('1','b') should be (Time('1','b') - Time('1','a')) * Value('1','b'). To do this, i would need a "shifted" version of column Time "starting" at row b so that i could do df["Product"] = (df["Time"].shifted - df["Time"]) * df["Value"]. The result should look like this:
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
This should do it:
>>> time_shifted = df['Time'].groupby(level=0).apply(lambda x: x.shift())
>>> df['Product'] = ((df.Time - time_shifted)*df.Value).fillna(0)
>>> df
Time Value Product
a 1 1 1 0
2 2 5 5
3 5 7 21
b 1 1 5 0
2 2 9 9
3 10 11 88
Hey this should do what you need it to. Comment if I missed anything.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Time':[1,2,5,1,2,10],'Value':[1,5,7,5,9,11]},
index = [['a','a','a','b','b','b'],[1,2,3,1,2,3]])
def product(x):
x['Product'] = (x['Time']-x.shift()['Time'])*x['Value']
return x
df = df.groupby(level =0).apply(product)
df['Product'] = df['Product'].replace(np.nan, 0)
print df