I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684
Related
I am having a data frame of four columns. I want to find the minimum among the first two columns and the last two columns for each row.
Code:
np.random.seed(0)
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
xdf['ab_min'] = xdf[['a','b']].min(axis=1)
xdf['cd_min'] = xdf[['c','d']].min(axis=1)
xdf['minimum'] = xdf['ab_min'].list()+xdf['cd_min'].list()
Expected answer:
xdf['minimum']
0 [ab_min,cd_min]
1 [ab_min,cd_min]
2 [ab_min,cd_min]
3 [ab_min,cd_min]
Present answer:
AttributeError: 'Series' object has no attribute 'list'
Select the columns ab_min and cd_min then use to_numpy to convert it to numpy array and assign the result to minimum column
xdf['minimum'] = xdf[['ab_min', 'cd_min']].to_numpy().tolist()
>>> xdf['minimum']
0 [3.23307959607905, 1.9836323494587338]
1 [6.189440334168731, 1.0578078219990983]
2 [3.1194570407645217, 1.2816570607783184]
3 [1.9170068676155894, 7.158027504597937]
4 [0.6244579166416464, 8.568849995324166]
5 [4.108986697339397, 0.6201685780268684]
6 [4.170639127277155, 2.3385281968695693]
7 [2.0831140755567814, 5.94063873401418]
8 [0.4887113296319978, 6.380570614449363]
9 [2.844815261473105, 0.9146457613970793]
Name: minimum, dtype: object
try this:
import pandas as pd
import numpy as np
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
print(xdf)
ab = xdf['ab_min'] = xdf[['a','b']].min(axis=1)
cd = xdf['cd_min'] = xdf[['c','d']].min(axis=1)
blah = pd.concat([ab, cd], axis=1)
print(blah)
results:
You can use .apply with a lambda function along axis=1:
xdf['minimum'] = xdf.apply(lambda x: [x[['a','b']].min(),x[['c','d']].min()], axis=1)
Result:
>>> xdf
a b c d minimum
0 0.662634 4.166338 8.864823 9.004818 [0.6626341544146663, 8.864822751494284]
1 6.854054 6.163417 6.510728 0.049498 [6.163416966676091, 0.04949754019059838]
2 6.389760 4.462319 2.435369 3.732534 [4.462318678134215, 2.4353686460846893]
3 4.628735 7.571098 1.900726 9.046384 [4.628735362058981, 1.9007255361271058]
4 3.203285 4.364302 2.473973 2.911911 [3.203285015796596, 2.4739732602476727]
5 5.357440 3.166420 9.908758 0.910704 [3.166420385020304, 0.91070444348338]
6 8.120486 6.395869 0.970977 5.278279 [6.395868901095546, 0.9709769503958143]
7 1.574765 7.184971 3.835641 4.495135 [1.574765093192545, 3.835640598199231]
8 8.688497 0.069061 0.771772 8.971878 [0.06906065557899743, 0.7717717844423222]
9 5.455920 2.630342 1.966357 7.374366 [2.6303421168291843, 1.966357159086991]
I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12
Using the same example from here but just changing the 'A' column to be something that can easily be grouped by:
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df["A"] = pd.Series([1]*3+ [2]*8)
df.head()
whose output now is:
Date A B C D E F G
0 2008-03-18 1 164.93 114.73 26.27 19.21 28.87 63.44
1 2008-03-19 1 164.89 114.75 26.22 19.07 27.76 59.98
2 2008-03-20 1 164.63 115.04 25.78 19.01 27.04 59.61
3 2008-03-25 2 163.92 114.85 27.41 19.61 27.84 59.41
4 2008-03-26 2 163.45 114.84 26.86 19.53 28.02 60.09
5 2008-03-27 2 163.46 115.40 27.09 19.72 28.25 59.62
6 2008-03-28 2 163.22 115.56 27.13 19.63 28.24 58.65
Doing the cumulative sums (code from the linked question) works well when we're assuming it's a single list:
# Put your inputs into a single list
input_cols = ["B", "C"]
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors1'] = df["single_input_vector"].cumsum()
but how do I cumsum the lists in this case grouped by 'A'? I expected this to work but it doesnt:
df['cumu'] = df.groupby("A")["single_input_vector"].apply(lambda x: list(x)).cumsum()
Instead of [[164.93, 114.73, 26.27], [164.89, 114.75, 26.... I get some rows filled in, others are NaN's. This is what I want (cols [B,C] accumulated into groups of col A):
A cumu
0 1 [[164.93,114.73], [164.89,114.75], [164.63,115.04]]
0 2 [[163.92,114.85], [163.45,114.84], [163.46,115.40], [163.22, 115.56]]
Also, how do I do this in an efficient manner? My dataset is quite big (about 2 million rows).
It doesn't look like your doing arithmetic sum, more like a concat along axis=1
First groupby and concat
temp_series = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])])
0 [[164.93, 114.73], [164.89, 114.75], [164.63, ...
1 [[163.92, 114.85], [163.45, 114.84], [163.46, ...
then convert back to a dataframe
df = temp_series.reset_index().rename(columns={0: 'cumsum'})
In one line
df = df.groupby('A').apply(lambda x: [[a,b] for a, b in zip(x['B'], x['C'])]).reset_index().rename(columns={0: 'cumsum'})
I am using dask dataframe.groupby().apply()
and get a dask series as a return value.
I am each group to a list triplets such as (a,b,1) and wish then to turn all the triplets into a single dask data frame
I am using this code in the end of the mapping function to return the triplets as a dask df
#assume here that trips is a generator for tripletes such as you would produce from itertools.product([l1,l2,l3])
trip = list(itertools.chain.from_iterable(trip))
df = pd.DataFrame.from_records(trip)
return dd.from_pandas(df,npartitions=1)
then when I try to use something similar to pandas concat with dask concatenate
Assume the result of the apply function is the variable result.
I am trying to use
import dask.dataframe as dd
dd.concat(result, axis=0
and get the error
raise TypeError("dfs must be a list of DataFrames/Series objects")
TypeError: dfs must be a list of DataFrames/Series objects
But when I check for the type of result using
print type(result)
I get
output: class 'dask.dataframe.core.Series'
What is the proper way to apply a function over groups of dask groupby object and get all the results into one dataframe?
Thanks
edit:--------------------------------------------------------------
in order to produce the use case, assume this fake data generation
import random
import pandas as pd
import dask.dataframe as dd
people = [[random.randint(1,3), random.randint(1,3), random.randint(1,3)] for i in range(1000)]
ddf = dd.from_pandas(pd.DataFrame.from_records(people, columns=["first name", "last name", "cars"]), npartitions=1)
Now my mission is to group people by first and last name (e.g all the people with same first name & first last name) and than I need to get a new dask data frame which will contain how many cars each group had.
Assume that the apply function can return either a series of lists of tuples e.g [(name,name,cars count),(name,name,cars count)] or a data frame with the same columns - name, name, car count.
Yes, I know that particular use case can be solved in another way, but please trust me, my use case is more complex. But i can not share the data and can not generate any similar data. so let's use a dummy data :-)
The challenge is to connect all the results of the apply into a single dask data frame (pandas data frame will be a problem here, data will not fit in memory - so transitions via a pandas data frame will be a problem)
For me working if output of apply is pandas DataFrame, so last if necessary convert to dask DataFrame:
def f(x):
trip = ((1,2,x) for x in range(3))
df = pd.DataFrame.from_records(trip)
return df
df1 = ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
#only for remove MultiIndex
df1 = df1.reset_index()
print (df1)
cars level_1 x y z
0 1 0 1 2 0
1 1 1 1 2 1
2 1 2 1 2 2
3 2 0 1 2 0
4 2 1 1 2 1
5 2 2 1 2 2
6 3 0 1 2 0
7 3 1 1 2 1
8 3 2 1 2 2
ddf1 = dd.from_pandas(df1,npartitions=1)
print (ddf1)
cars level_1 x y z
npartitions=1
0 int64 int64 int64 int64 int64
8 ... ... ... ... ...
Dask Name: from_pandas, 1 tasks
EDIT:
L = []
def f(x):
trip = ((1,2,x) for x in range(3))
#append each
L.append(da.from_array(np.array(list(trip)), chunks=(1,3)))
ddf.groupby('cars').apply(f, meta={'x': 'i8', 'y': 'i8', 'z': 'i8'}).compute()
dar = da.concatenate(L, axis=0)
print (dar)
dask.array<concatenate, shape=(12, 3), dtype=int32, chunksize=(1, 3)>
For your edit:
In [8]: ddf.groupby(['first name', 'last name']).cars.count().compute()
Out[8]:
first name last name
1 1 107
2 107
3 110
2 1 117
2 120
3 99
3 1 119
2 103
3 118
Name: cars, dtype: int64
I have written code to iterate through a dataset that has a demarcation column. This column consist of a value shared by all equally demarked rows. The code iterate through each demarcated section with a nested loop to iterate through each line, finding the nearest neighbor for each row in its respective demarcated block
import pandas as pd
import numpy as np
Create a df with XYZ and Section demark
p=5
df = pd.DataFrame(np.random.randn(100, 3), columns=list('XYZ'))
df2 = df.sort('Z')
df2 = df2.reset_index(drop=True)
df2['Section_demark'] = (df2.index/p).astype('int')
df2.head(15)
X Y Z Section_demark
0 -1.125526 -0.249091 -2.505444 0
1 0.710114 1.357477 -2.195904 0
2 -0.580319 -0.997311 -2.031280 0
3 1.311526 -0.268590 -1.741079 0
4 0.481450 0.448904 -1.546278 0
5 -1.820224 -0.846628 -1.392700 1
6 0.528618 0.418862 -1.388170 1
7 0.360560 -0.309429 -1.319548 1
8 -0.369107 -1.290528 -1.233815 1
9 0.139063 0.045076 -1.209820 1
10 0.049387 1.087300 -1.188375 2
11 0.678247 -1.191882 -1.172214 2
12 -0.976294 -0.752081 -1.092286 2
13 0.875952 0.319304 -1.079185 2
14 0.469730 -0.329548 -1.044178 2
Function for euclidean distance
def eucl_d(item_id):
a = df3.sub(df3.iloc[item_id], axis=1)
b = np.sum( np.square(a), axis=1 )
return b
Iterate through the section demarks, iterate through the lines in each Section_demark and find nearest neighbor,
Isolate the row nearest to top row and create a series, take the ix for that series and compile a list from it.
read the list back to df2, creating a new column with the Nearest neighbor index number as value
s=0
elements = []
while s<(len(df2)/p):
df3 = df2[df2['Section_demark']==(s)]
r=0
while r<(p):
df4=df3.copy()
df4['dist'] = eucl_d(r)
df4 = df4.sort('dist')
ser = df4.iloc[1]
elements.append(ser.name)
r=r+1
s=s+1
df2["NNIX"] = elements
df2.head(10)
X1 Y1 Z1 NNIX
0 0.002299 1.284195 -1.604009 1
1 -0.444305 0.346856 -2.396538 0
2 -0.490741 -1.416682 -1.423573 3
3 0.203635 -0.676841 -1.596332 2
4 0.002299 1.284195 -1.604009 1
5 -0.314330 0.036554 -1.153127 6
6 -0.387839 0.129000 -1.235331 5
7 -0.314330 0.036554 -1.153127 6
8 -0.059477 -0.205260 -1.136376 7
9 0.717980 0.130665 -1.040372 8
I would like to exchange the last section of iteration with a groupby command and use aggregate or apply to run the eucl_d function, but it eludes me
I can get df2 grouped by running this:
grouped = df3.groupby('Section_demark')
Its the second step that is giving me trouble
I was thinking:
grouped.agg(eucl_d(item_id))
But I dont know how to specify the item_id for eucl_d(item_id)