Append results of DataFrame apply lamda to DateFrame or new Series - python

I am using the apply method with a lamda to compute on each row of a Dataframe to return a Series.
statsSeries = matchData.apply(lambda row: mytest(row), axis=1)
where mytest(row) is a function that returns timestamp, float, float.
def mytest(row):
timestamp = row['timestamp']
wicketsPerOver = row['wickets']/row['overs']
runsPerWicket = row['runs']/row['wickets']
return timestamp, wicketsPerOver, runsPerWicket
As I have written it, the statsSeries contains two columns, one an index and the other a tuple of the (timestamp, wicketsPerOver, runsPerWicket).
How can I return a Series with three columns [timestamp, wicketsPerOver, runsPerWicket]?

It appears you need to use pd.Series.apply(pd.Series).
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({0: [1, 2, 3, 4]})
def add_some(row):
return row[0]+1, row[0]+2, row[0]+3
df[[1, 2, 3]] = df.apply(add_some, axis=1).apply(pd.Series)
print(df)
0 1 2 3
0 1 2 3 4
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7

Related

Count values of each row in pandas dataframe only for consecutive numbers

I got a pandas dataframe that looks like this:
I want to count how many rows are for each id and print the result. The problem is I want to count that ONLY for consecutive numbers in "frame num".
For example: if frame num is: [1,2,3,45,47,122,123,124,125] and id is [1,1,1,1,1,1,1,1,1] it should print: 3 1 1 4 (and do that for EACH id).
Is there any way to do that? I got crazy trying to figure it out! To count rows for each id should be enought to use a GROUP BY. But with this new condition its difficult.
You can use pandas.DataFrame.shift() for finding consecutive numbers then use itertools.groupby for creating a list of counting consecutive.
import pandas as pd
from itertools import chain
from itertools import groupby
# Example input dataframe
df = pd.DataFrame({
'num' : [1,2,3,45,47,122,123,124,125,1,2,3,45,47,122,123,124,125],
'id' : [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2]
})
df['s'] = (df['num']-1 == df['num'].shift()) | (df['num']+1 == df['num'].shift(-1))
res = df.groupby('id')['s'].apply(lambda g: list(chain.from_iterable([[len(list(group))] if key else [1]*len(list(group))
for key, group in groupby( g )])))
print(res)
Output:
id
1 [3, 1, 1, 4]
2 [3, 1, 1, 4]
Name: s, dtype: object
Update: Get the output as a dataframe:
>>> res.to_frame().explode('s').reset_index()
id s
0 1 3
1 1 1
2 1 1
3 1 4
4 2 3
5 2 1
6 2 1
7 2 4

Make index first row in group in pandas dataframe

I was wondering if it were possible to make the first row of each group based on index, the name of that index. Suppose we have a df like this:
dic = {'index_col': ['a','a','a','b','b','b'],'col1': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(dic).set_index('index_col')
Is it possible to transform the dataframe above to one that looks like the one below? What happened here is that the index has been reset, and for every group, the first row is the index name?
The result is a pandas.Series;
df_list = []
for label , group in df.groupby('index_col'):
df_list.append(pandas.concat([pandas.Series([label]), group['col1']]))
df_result = pandas.concat(df_list).reset_index(drop=True)
Output;
0 a
1 1
2 2
3 3
4 b
5 4
6 5
7 6
dtype: object
Call df_result.to_frame() if you want a data-frame.

How to parse out array from column inside a dataframe?

I have a data frame that looks like this:
Index Values Digits
1 [1.0,0.13,0.52...] 3
2 [1.0,0.13,0.32...] 3
3 [1.0,0.31,0.12...] 1
4 [1.0,0.30,0.20...] 2
5 [1.0,0.30,0.20...] 3
My output should be:
Index Values Digits
1 [0.33,0.04,0.17...] 3
2 [0.33,0.04,0.11...] 3
3 [0.33,0.10,0.40...] 1
4 [0.33,0.10,0.07...] 2
5 [0.33,0.10,0.07...] 3
I believe that the Values column has a np.array within the cells? Is this technically an array.
I wish to parse out the Values column and divide all values within the array by 3.
My attempts have stopped at the parsing out of the values:
a = df(df['Values'].values.tolist())
IIUC, apply the list calculation
df.Values.apply(lambda x : [y/3 for y in x])
Out[1095]:
0 [0.3333333333333333, 0.043333333333333335, 0.1...
1 [0.3333333333333333, 0.043333333333333335, 0.1...
Name: Values, dtype: object
#df.Values=df.Values.apply(lambda x : [y/3 for y in x])
Created dataframe:
import pandas as pd
d = {'col1': [[1,10], [2,20]], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
created function:
def divide_by_3(lst):
outpuut =[]
for i in lst:
outpuut.append(i/3.0)
return outpuut
apply function :
df.col1.apply(divide_by_3`)
result:
0 [0.333333333333, 3.33333333333]
1 [0.666666666667, 6.66666666667]

pandas filter large dataframe and order by a list

I have a large dataframe as follows:
master_df
result item
0 5 id13
1 6 id23432
2 3 id2832
3 4 id9823
......
84376253 7 id9632
And another smaller dataframe as follows:
df = pd.DataFrame({'item' : ['id9632', 'id13', 'id2832', 'id2342']})
How can I extract the relevant elements from master_df.result to match with df.item so I can achieve the following:
df = df.assign(result=list_of_results_in_order)
You can do merge also:
df = df.merge(master_df, on='item', how='left)
I think need isin with boolean indexing:
#for Series
s = master_df.loc[master_df['item'].isin(df['item']),'result']
print (s)
0 5
2 3
84376253 7
Name: result, dtype: int64
#for list
L = master_df.loc[master_df['item'].isin(df['item']),'result'].tolist()
print (L)
[5, 3, 7]
#for DataFrame
df1 = master_df[master_df['item'].isin(df['item'])]
print (df1)
result item
0 5 id13
2 3 id2832
84376253 7 id9632

Unpack DataFrame with tuple entries into separate DataFrames

I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684

Categories