I'm trying to add two dataframes with Multiindex Columns and different index sizes together. What is the most elegant solution. And example is:
names = ['Level 0', 'Level 1']
cols1 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A2', 'B1']], names = names)
cols2 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A3', 'B1']], names = names)
df1 = pd.DataFrame(np.random.randn(1, 3), index=range(1), columns=cols1)
df2 = pd.DataFrame(np.random.randn(5, 3), index=range(5), columns=cols2)
print(df1)
print(df2)
Level 0 A B
Level 1 A1 A2 B1
0 -0.116975 -0.391591 0.446029
Level 0 A B
Level 1 A1 A3 B1
0 1.179689 0.693096 -0.102621
1 -0.913441 0.187332 1.465217
2 -0.089724 -1.907706 -0.963699
3 0.203217 -1.233399 0.006726
4 0.218911 -0.027446 0.982764
Now I try to add df1 to df2 with the logic that missing columns are just added and that the index 0 of df1 is added to all indices in df2.
So I would expect with the above numbers:
Level 0 A B
Level 1 A1 A2 A3 B1
0 1.062714 -0.391591 0.693096 0.343408
1 -1.030416 -0.391591 0.187332 1.911246
2 -0.206699 -0.391591 -1.907706 -0.51767
3 0.086242 -0.391591 -1.233399 0.452755
4 0.101936 -0.391591 -0.027446 1.428793
What is the most speed and memory efficient solution? Any help appreciated.
Setup
In [76]: df1
Out[76]:
Level 0 A B
Level 1 A1 A2 B1
0 -0.28667 1.852091 -0.134793
In [77]: df2
Out[77]:
Level 0 A B
Level 1 A1 A3 B1
0 -0.023582 -0.713594 0.487355
1 0.628819 0.764721 -1.118777
2 -0.572421 1.326448 -0.788531
3 -0.160608 1.985142 0.344845
4 -0.184555 -1.075794 0.630975
This will align the frames and fill the nan's with 0
but not broadcast
In [63]: df1a,df2a = df1.align(df2,fill_value=0)
In [64]: df1a+df2a
Out[64]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.628819 0.000000 0.764721 -1.118777
2 -0.572421 0.000000 1.326448 -0.788531
3 -0.160608 0.000000 1.985142 0.344845
4 -0.184555 0.000000 -1.075794 0.630975
This is the way to broadcast the first one
In [65]: df1a,df2a = df1.align(df2)
In [66]: df1a.ffill().fillna(0) + df2a.fillna(0)
Out[66]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.342149 1.852091 0.764721 -1.253570
2 -0.859091 1.852091 1.326448 -0.923324
3 -0.447278 1.852091 1.985142 0.210052
4 -0.471226 1.852091 -1.075794 0.496181
Related
I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(
{
'pos': ['A1', 'B03', 'A2', 'B01', 'A3', 'B02'],
'ignore': range(6)
}
)
pos ignore
0 A1 0
1 A03 1
2 A2 2
3 B01 3
4 B3 4
5 B02 5
Which I would like to sort according to pos whereby
it should be first sorted according to the number and then according to the letter and
leading 0s should be ignored,
so the desired outcome is
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I currently do it like this:
df[['let', 'num']] = df['pos'].str.extract(
'([A-Za-z]+)([0-9]+)'
)
df['num'] = df['num'].astype(int)
df = (
df.sort_values(['num', 'let'])
.drop(['let', 'num'], axis=1)
.reset_index(drop=True)
)
That works, but what I don't like is that I need two temporary columns I later have to drop again. Is there a more straightforward way of doing it?
You can use argsort with zfill and first sort on the numbers as 01, 02, 03 etc. This way you don't have to assign / drop columns:
val = df['pos'].str.extract('(\D+)(\d+)')
df.loc[(val[1].str.zfill(2) + val[0]).argsort()]
pos ignore
0 A1 0
3 B01 3
2 A2 2
5 B02 5
4 A3 4
1 B03 1
Here's one way:
import re
def extract_parts(x):
groups = re.match('([A-Za-z]+)([0-9]+)', x)
return (int(groups[2]), groups[1])
df.reindex(df.pos.transform(extract_parts).sort_values().index).reset_index(drop=True)
Output
Out[1]:
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?
For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1
If I have a Pandas dataframe like so:
colA colB
A A1
B C1
A B1
B A1
colA has 2 unique values (A, B) and colB has 3 unique values (A1, B1 and C1).
I would like to create a new dataframe where colA and colB are all combinations and another column colC which is 1 or 0 based on the combination present in earlier df.
expected result:
colA colB colC
A A1 1
A B1 1
A C1 0
B A1 1
B B1 0
B C1 1
First create new column by DataFrame.assign filled by 1, then create MultiIndex.from_product by Series.unique values of both columns and after DataFrame.set_index use DataFrame.reindex - there is possible set value for new appended rows in colC column by fill_value parameter:
mux = pd.MultiIndex.from_product([df['colA'].unique(),
df['colB'].unique()], names=['colA','colB'])
df1 = df.assign(colC = 1).set_index(['colA','colB']).reindex(mux, fill_value=0).reset_index()
print (df1)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Alternative is use reshape by DataFrame.set_index, Series.unstack and DataFrame.stack:
df1 = (df.assign(colC = 1)
.set_index(['colA','colB'])['colC']
.unstack(fill_value=0)
.stack()
.reset_index(name='ColC'))
print (df1)
colA colB ColC
0 A A1 1
1 A B1 1
2 A C1 0
3 B A1 1
4 B B1 0
5 B C1 1
Another solution is create new DataFrame by itertools.product, DataFrame.merge with indicator=True, rename column and set by compare by both and casting to integer for True/False to 1/0 mapping:
from itertools import product
df1 = pd.DataFrame(product(df['colA'].unique(), df['colB'].unique()), columns=['colA','colB'])
df = df1.merge(df, how='left', indicator=True).rename(columns={'_merge':'colC'})
df['colC'] = df['colC'].eq('both').astype(int)
print (df)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Last if necessary add sorting by both columns by DataFrame.sort_values:
df1 = df1.sort_values(['colA','colB'])
I have two dataframes:
df1=
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
df2=
A B C
0 A2 B2 C10
1 A1 B3 C11
2 A9 B4 C12
and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2)
A B C
0 A0 B0 C0
And I would like to return a series with
0 False
1 True
2 True
Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2)
A B C
0 A0 B0 C0
1 A1 B1 C1
And I would want to return a series with
0 False
1 False
2 True
I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this.
If your version is 0.17.0 then you can use pd.merge and pass the cols of interest, how='left' and set indicator=True to whether the values are only present in left or both. You can then test whether the appended _merge col is equal to 'both':
In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'
Out[102]:
0 False
1 True
2 True
Name: _merge, dtype: bool
In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'
Out[103]:
0 False
1 False
2 True
Name: _merge, dtype: bool
output from the merge:
In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)
Out[104]:
A B_x C_x B_y C_y _merge
0 A0 B0 C0 NaN NaN left_only
1 A1 B1 C1 B3 C11 both
2 A2 B2 C2 B2 C10 both
In [105]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)
Out[105]:
A B C_x C_y _merge
0 A0 B0 C0 NaN left_only
1 A1 B1 C1 NaN left_only
2 A2 B2 C2 C10 both
Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html)
Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match:
COLS = ['A', 'B'] # or whichever columns to use for comparison
df1[~df1[COLS].isin(df2[COLS].to_dict(
orient='list')).all(axis=1)]
~df1['A'].isin(df2['A'])
Should get you the series you want
df1[ ~df1['A'].isin(df2['A'])]
The dataframe:
A B C
0 A0 B0 C0
Method ( 1 )
In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:
0 False
1 False
2 True
Method ( 2 )
you can use the left merge to obtain values that exist in both frames + values that exist in the first data frame only
In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
A B C_x C_y
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 C10
then of course values that exist only in the first frame will have NAN values in columns of the other data frame , then you can filter by this NAN values by doing the following
In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
A B C_x
0 A0 B0 C0
1 A1 B1 C1
In [17]:
if you want to get whether the values in A exists in B you can do the following
In [20]:
pd.notnull(left['C_y'])
Out[20]:
0 False
1 False
2 True
Name: C_y, dtype: bool