Join Pandas Dataframes according to array - python

I am attempting to join several dataframes together. The list of names of these dataframes is stored in another dataframe called companies, which is displayed below.
>>> companies
16: Symbols
0 TUES
1 DRAM
2 NTRS
3 PCBK
4 CRIS
5 PERY
6 IRDM
7 GNCMA
8 IBOC
My aim would be to do something like this: joined=TUES.join(DRAM) then joined=joined.join(NTRS) and so on, down the list. How might I be able to reference elements of the Symbols column of the dataframe companies in order to achieve this?
Many thanks in advance!

You can define an empty DataFrame and append all other dataframes to it. See the example below:
combined_df = pandas.DataFrame()
for df in other_dataframes:
combined_df = combined_df.append(df)

Use pd.concat, it is designed for merging lists of dfs:
so for your example just turn the values into a list and then concat:
joined = pd.concat(list(companies['Symbols']), axis=1)
Example:
In [4]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5)})
df1 = pd.DataFrame({'c':np.random.randn(5), 'd':np.random.randn(5)})
df2 = pd.DataFrame({'e':np.random.randn(5), 'f':np.random.randn(5)})
df_list=[df,df2,df1]
df_list
Out[4]:
[ a b
0 0.143116 1.205407
1 -0.430869 1.429313
2 0.059810 0.430131
3 2.554849 -1.450640
4 -1.127638 0.715323
[5 rows x 2 columns], e f
0 0.658967 1.150672
1 0.813355 -0.252577
2 0.885928 0.970844
3 0.519375 -1.929081
4 -0.217152 0.907535
[5 rows x 2 columns], c d
0 -1.375885 1.422697
1 -0.870040 0.135527
2 -0.696600 1.954966
3 0.494035 -0.727816
4 -0.367156 -0.216115
[5 rows x 2 columns]]
In [8]:
# now concatenate the list of dfs, by column
pd.concat(df_list,axis=1)
Out[8]:
a b e f c d
0 0.143116 1.205407 0.658967 1.150672 -1.375885 1.422697
1 -0.430869 1.429313 0.813355 -0.252577 -0.870040 0.135527
2 0.059810 0.430131 0.885928 0.970844 -0.696600 1.954966
3 2.554849 -1.450640 0.519375 -1.929081 0.494035 -0.727816
4 -1.127638 0.715323 -0.217152 0.907535 -0.367156 -0.216115
[5 rows x 6 columns]

Related

How to feed random numbers as indices to pandas data frame?

I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.
Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.
Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])

Extend dataframe with contents of series of arrays

I have a pandas DataFrame bb and a pandas Series of numpy arrays, aa with the same number of rows.
>>> bb
A B
0 0.049315 0.362793
1 0.853909 0.590942
2 0.854748 0.247608
3 0.084967 0.293541
4 0.053430 0.922705
5 0.571357 0.404485
6 0.363018 0.070912
7 0.784807 0.641253
>>> aa
0 [0.4648, 0.8575, 0.5008]
1 [0.3056, 0.2737, 0.0137]
2 [0.8038, 0.0858, 0.345]
3 [0.4135, 0.7571, 0.3686]
4 [0.7482, 0.8063, 0.7976]
5 [0.9359, 0.5873, 0.2319]
6 [0.8838, 0.7109, 0.712]
7 [0.6493, 0.1516, 0.5401]
dtype: object
I need to add three columns to the DataFrame bb containing the elements of aa. The desired result is this:
A B v0 v1 v2
0 0.049315 0.362793 0.4648 0.8575 0.5008
1 0.853909 0.590942 0.3056 0.2737 0.0137
2 0.854748 0.247608 0.8038 0.0858 0.3450
3 0.084967 0.293541 0.4135 0.7571 0.3686
4 0.053430 0.922705 0.7482 0.8063 0.7976
5 0.571357 0.404485 0.9359 0.5873 0.2319
6 0.363018 0.070912 0.8838 0.7109 0.7120
7 0.784807 0.641253 0.6493 0.1516 0.5401
I can realise this with the following code:
rows, cols = 8, 3
ixs = ["v" + str(i) for i in range(cols)]
bb[ixs] = pd.DataFrame(np.zeros((8, 3)))
for i in range(rows):
for j in range(cols):
bb[ixs[j]][i] = aa[i][j]
However this is extremely slow on the larger DataFrames that I have. Is there a more idiomatic way to do this in pandas/numpy that works more quickly?
Create DataFrame by constructor, change columns names by add_prefix and add to original by join or concat:
df = bb.join(pd.DataFrame(aa.values.tolist()).add_prefix('v'))
Or:
df = pd.concat([bb, pd.DataFrame(aa.values.tolist()).add_prefix('v')], axis=1)

pandas filter large dataframe and order by a list

I have a large dataframe as follows:
master_df
result item
0 5 id13
1 6 id23432
2 3 id2832
3 4 id9823
......
84376253 7 id9632
And another smaller dataframe as follows:
df = pd.DataFrame({'item' : ['id9632', 'id13', 'id2832', 'id2342']})
How can I extract the relevant elements from master_df.result to match with df.item so I can achieve the following:
df = df.assign(result=list_of_results_in_order)
You can do merge also:
df = df.merge(master_df, on='item', how='left)
I think need isin with boolean indexing:
#for Series
s = master_df.loc[master_df['item'].isin(df['item']),'result']
print (s)
0 5
2 3
84376253 7
Name: result, dtype: int64
#for list
L = master_df.loc[master_df['item'].isin(df['item']),'result'].tolist()
print (L)
[5, 3, 7]
#for DataFrame
df1 = master_df[master_df['item'].isin(df['item'])]
print (df1)
result item
0 5 id13
2 3 id2832
84376253 7 id9632

Restore hierarchical column index when using groupby in pandas

I am using groupby in pandas to compute some aggregates statistics in pandas on data where columns in a data frame are organized with a hierarchical index.
For the computed statistics I want to get back to a table form in the end, where the groups are reconverted to columns with the group values, e.g. like:
index = pd.MultiIndex.from_tuples([('A', 'a'), ('B', 'b')])
df = pd.DataFrame(np.random.randn(8,2), columns=index)
which results in e.g. this data frame
A B
a b
0 0.511157 0.334748
1 0.031113 -0.477456
2 0.288080 -0.258238
3 0.138467 -0.955547
4 -0.087873 0.017494
5 -0.667393 1.190039
6 -0.068245 -1.282864
7 -0.996982 0.589667
Now I compute the statistics using groupby and reset the index to recreate a flat data frame:
df.groupby([('A','a')]).mean().reset_index()
(A, a) B
b
0 -0.996982 0.589667
1 -0.667393 1.190039
2 -0.087873 0.017494
3 -0.068245 -1.282864
4 0.031113 -0.477456
5 0.138467 -0.955547
6 0.288080 -0.258238
7 0.511157 0.334748
How can I achieve that ('A', 'a') becomes a part of the multi index again, hopefully in an automatic fashion? Or stated otherwise: is there a way to preserve the hierarchical column structure during the groupby operation.
For me work add parameter as_index=False to groupby:
print df.groupby([('A','a')], as_index=False).mean()
A B
a b
0 -0.765088 -0.556601
1 -0.628040 2.074559
2 -0.516396 -2.028387
3 -0.152027 0.389853
4 0.450218 1.474989
5 0.718040 -0.882018
6 1.932556 -0.977316
7 2.028468 -0.875167
The simplest thing to do is reassign back the original columns:
In [182]:
df1 = df.groupby([('A','a')]).mean().reset_index()
df1.columns = df.columns
df1
Out[182]:
A B
a b
0 -0.857465 -0.761948
1 -0.263677 0.538251
2 0.067710 -1.038906
3 0.345584 -0.425514
4 0.478200 0.119345
5 0.639305 0.047526
6 1.528260 1.956677
7 3.114834 -0.532462

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories