Extend dataframe with contents of series of arrays - python

I have a pandas DataFrame bb and a pandas Series of numpy arrays, aa with the same number of rows.
>>> bb
A B
0 0.049315 0.362793
1 0.853909 0.590942
2 0.854748 0.247608
3 0.084967 0.293541
4 0.053430 0.922705
5 0.571357 0.404485
6 0.363018 0.070912
7 0.784807 0.641253
>>> aa
0 [0.4648, 0.8575, 0.5008]
1 [0.3056, 0.2737, 0.0137]
2 [0.8038, 0.0858, 0.345]
3 [0.4135, 0.7571, 0.3686]
4 [0.7482, 0.8063, 0.7976]
5 [0.9359, 0.5873, 0.2319]
6 [0.8838, 0.7109, 0.712]
7 [0.6493, 0.1516, 0.5401]
dtype: object
I need to add three columns to the DataFrame bb containing the elements of aa. The desired result is this:
A B v0 v1 v2
0 0.049315 0.362793 0.4648 0.8575 0.5008
1 0.853909 0.590942 0.3056 0.2737 0.0137
2 0.854748 0.247608 0.8038 0.0858 0.3450
3 0.084967 0.293541 0.4135 0.7571 0.3686
4 0.053430 0.922705 0.7482 0.8063 0.7976
5 0.571357 0.404485 0.9359 0.5873 0.2319
6 0.363018 0.070912 0.8838 0.7109 0.7120
7 0.784807 0.641253 0.6493 0.1516 0.5401
I can realise this with the following code:
rows, cols = 8, 3
ixs = ["v" + str(i) for i in range(cols)]
bb[ixs] = pd.DataFrame(np.zeros((8, 3)))
for i in range(rows):
for j in range(cols):
bb[ixs[j]][i] = aa[i][j]
However this is extremely slow on the larger DataFrames that I have. Is there a more idiomatic way to do this in pandas/numpy that works more quickly?

Create DataFrame by constructor, change columns names by add_prefix and add to original by join or concat:
df = bb.join(pd.DataFrame(aa.values.tolist()).add_prefix('v'))
Or:
df = pd.concat([bb, pd.DataFrame(aa.values.tolist()).add_prefix('v')], axis=1)

Related

How to get a transition string per row object based on two different columns in python (without using loops)?

I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12

pandas filter large dataframe and order by a list

I have a large dataframe as follows:
master_df
result item
0 5 id13
1 6 id23432
2 3 id2832
3 4 id9823
......
84376253 7 id9632
And another smaller dataframe as follows:
df = pd.DataFrame({'item' : ['id9632', 'id13', 'id2832', 'id2342']})
How can I extract the relevant elements from master_df.result to match with df.item so I can achieve the following:
df = df.assign(result=list_of_results_in_order)
You can do merge also:
df = df.merge(master_df, on='item', how='left)
I think need isin with boolean indexing:
#for Series
s = master_df.loc[master_df['item'].isin(df['item']),'result']
print (s)
0 5
2 3
84376253 7
Name: result, dtype: int64
#for list
L = master_df.loc[master_df['item'].isin(df['item']),'result'].tolist()
print (L)
[5, 3, 7]
#for DataFrame
df1 = master_df[master_df['item'].isin(df['item'])]
print (df1)
result item
0 5 id13
2 3 id2832
84376253 7 id9632

Unpack DataFrame with tuple entries into separate DataFrames

I wrote a small class to compute some statistics through bootstrap without replacement. For those not familiar with this technique, you get n random subsamples of some data, compute the desired statistic (lets say the median) on each subsample, and then compare the values across subsamples. This allows you to get a measure of variance on the obtained median over the dataset.
I implemented this in a class but reduced it to a MWE given by the following function
import numpy as np
import pandas as pd
def bootstrap_median(df, n=5000, fraction=0.1):
if isinstance(df, pd.DataFrame):
columns = df.columns
else:
columns = None
# Get the values as a ndarray
arr = np.array(df.values)
# Get the bootstrap sample through random permutations
sample_len = int(len(arr)*fraction)
if sample_len<1:
sample_len = 1
sample = []
for n_sample in range(n):
sample.append(arr[np.random.permutation(len(arr))[:sample_len]])
sample = np.array(sample)
# Compute the median on each sample
temp = np.median(sample, axis=1)
# Get the mean and std of the estimate across samples
m = np.mean(temp, axis=0)
s = np.std(temp, axis=0)/np.sqrt(len(sample))
# Convert output to DataFrames if necesary and return
if columns:
m = pd.DataFrame(data=m[None, ...], columns=columns)
s = pd.DataFrame(data=s[None, ...], columns=columns)
return m, s
This function returns the mean and standard deviation across the medians computed on each bootstrap sample.
Now consider this example DataFrame
data = np.arange(20)
group = np.tile(np.array([1, 2]).reshape(-1,1), (1,10)).flatten()
df = pd.DataFrame.from_dict({'data': data, 'group': group})
print(df)
print(bootstrap_median(df['data']))
this prints
data group
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
(9.5161999999999995, 0.056585753613431718)
So far so good because bootstrap_median returns a tuple of two elements. However, if I do this after a groupby
In: df.groupby('group')['data'].apply(bootstrap_median)
Out:
group
1 (4.5356, 0.0409710449952)
2 (14.5006, 0.0403772204095)
The values inside each cell are tuples, as one would expect from apply. I can unpack the result into two DataFrame's by iterating over elements like this:
index = []
data1 = []
data2 = []
for g, (m, s) in out.iteritems():
index.append(g)
data1.append(m)
data2.append(s)
dfm = pd.DataFrame(data=data1, index=index, columns=['E[median]'])
dfm.index.name = 'group'
dfs = pd.DataFrame(data=data2, index=index, columns=['std[median]'])
dfs.index.name = 'group'
thus
In: dfm
Out:
E[median]
group
1 4.5356
2 14.5006
In: dfs
Out:
std[median]
group
1 0.0409710449952
2 0.0403772204095
This is a bit cumbersome and my question is if there is a more pandas native way to "unpack" a dataframe whose values are tuples into separate DataFrame's
This question seemed related but it concerned string regex replacements and not unpacking true tuples.
I think you need change:
return m, s
to:
return pd.Series([m, s], index=['m','s'])
And then get:
df1 = df.groupby('group')['data'].apply(bootstrap_median)
print (df1)
group
1 m 4.480400
s 0.040542
2 m 14.565200
s 0.040373
Name: data, dtype: float64
So is possible select by xs:
print (df1.xs('s', level=1))
group
1 0.040542
2 0.040373
Name: data, dtype: float64
print (df1.xs('m', level=1))
group
1 4.4804
2 14.5652
Name: data, dtype: float64
Also if need one column DataFrame add to_frame:
df1 = df.groupby('group')['data'].apply(bootstrap_median).to_frame()
print (df1)
data
group
1 m 4.476800
s 0.041100
2 m 14.468400
s 0.040719
print (df1.xs('s', level=1))
data
group
1 0.041100
2 0.040719
print (df1.xs('m', level=1))
data
group
1 4.4768
2 14.4684

Join Pandas Dataframes according to array

I am attempting to join several dataframes together. The list of names of these dataframes is stored in another dataframe called companies, which is displayed below.
>>> companies
16: Symbols
0 TUES
1 DRAM
2 NTRS
3 PCBK
4 CRIS
5 PERY
6 IRDM
7 GNCMA
8 IBOC
My aim would be to do something like this: joined=TUES.join(DRAM) then joined=joined.join(NTRS) and so on, down the list. How might I be able to reference elements of the Symbols column of the dataframe companies in order to achieve this?
Many thanks in advance!
You can define an empty DataFrame and append all other dataframes to it. See the example below:
combined_df = pandas.DataFrame()
for df in other_dataframes:
combined_df = combined_df.append(df)
Use pd.concat, it is designed for merging lists of dfs:
so for your example just turn the values into a list and then concat:
joined = pd.concat(list(companies['Symbols']), axis=1)
Example:
In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df1 = pd.DataFrame({'c':np.random.randn(5), 'd':np.random.randn(5)})
df2 = pd.DataFrame({'e':np.random.randn(5), 'f':np.random.randn(5)})
df_list=[df,df2,df1]
df_list
Out[4]:
[ a b
0 0.143116 1.205407
1 -0.430869 1.429313
2 0.059810 0.430131
3 2.554849 -1.450640
4 -1.127638 0.715323
[5 rows x 2 columns], e f
0 0.658967 1.150672
1 0.813355 -0.252577
2 0.885928 0.970844
3 0.519375 -1.929081
4 -0.217152 0.907535
[5 rows x 2 columns], c d
0 -1.375885 1.422697
1 -0.870040 0.135527
2 -0.696600 1.954966
3 0.494035 -0.727816
4 -0.367156 -0.216115
[5 rows x 2 columns]]
In [8]:
# now concatenate the list of dfs, by column
pd.concat(df_list,axis=1)
Out[8]:
a b e f c d
0 0.143116 1.205407 0.658967 1.150672 -1.375885 1.422697
1 -0.430869 1.429313 0.813355 -0.252577 -0.870040 0.135527
2 0.059810 0.430131 0.885928 0.970844 -0.696600 1.954966
3 2.554849 -1.450640 0.519375 -1.929081 0.494035 -0.727816
4 -1.127638 0.715323 -0.217152 0.907535 -0.367156 -0.216115
[5 rows x 6 columns]

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories