Create a dataframe based on another dataframe using unique values - python

If I have a Pandas dataframe like so:
colA colB
A A1
B C1
A B1
B A1
colA has 2 unique values (A, B) and colB has 3 unique values (A1, B1 and C1).
I would like to create a new dataframe where colA and colB are all combinations and another column colC which is 1 or 0 based on the combination present in earlier df.
expected result:
colA colB colC
A A1 1
A B1 1
A C1 0
B A1 1
B B1 0
B C1 1

First create new column by DataFrame.assign filled by 1, then create MultiIndex.from_product by Series.unique values of both columns and after DataFrame.set_index use DataFrame.reindex - there is possible set value for new appended rows in colC column by fill_value parameter:
mux = pd.MultiIndex.from_product([df['colA'].unique(),
df['colB'].unique()], names=['colA','colB'])
df1 = df.assign(colC = 1).set_index(['colA','colB']).reindex(mux, fill_value=0).reset_index()
print (df1)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Alternative is use reshape by DataFrame.set_index, Series.unstack and DataFrame.stack:
df1 = (df.assign(colC = 1)
.set_index(['colA','colB'])['colC']
.unstack(fill_value=0)
.stack()
.reset_index(name='ColC'))
print (df1)
colA colB ColC
0 A A1 1
1 A B1 1
2 A C1 0
3 B A1 1
4 B B1 0
5 B C1 1
Another solution is create new DataFrame by itertools.product, DataFrame.merge with indicator=True, rename column and set by compare by both and casting to integer for True/False to 1/0 mapping:
from itertools import product
df1 = pd.DataFrame(product(df['colA'].unique(), df['colB'].unique()), columns=['colA','colB'])
df = df1.merge(df, how='left', indicator=True).rename(columns={'_merge':'colC'})
df['colC'] = df['colC'].eq('both').astype(int)
print (df)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Last if necessary add sorting by both columns by DataFrame.sort_values:
df1 = df1.sort_values(['colA','colB'])

Related

Map and merge values from another dataframe

1.input dataframe with random num values:
ID num
a 2
a,b 3
b 1
c,e 4
I have another dataframe:
ID name
a a1
b b5
c c4
d d1
e e6
2.Expected result : I want to map the df1 with df2 on ID and stored as another column:
ID num ID_name
a 2 a1
a,b 3 a1,b5
b 1 b5
c,e 4 c4,e6
3.code i tried:
df1["ID_name"] = df1["ID"].map(df2)
df1
But the values are not getting mapped and showing NAN for most of the values
We can use Series.str.split then use Series.map and groupby on the elements.
df["ID_name"] = (
df["ID"]
.str.split(",")
.explode()
.map(df2.set_index("ID")["name"])
.groupby(level=0)
.agg(",".join)
)
ID num ID_name
0 a 2 a1
1 a,b 3 a1,b5
2 b 1 b5
3 c,e 4 c4,e6

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

Faster copying of pandas data with some conditions

I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?
For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1

How to pivot a dataframe into a square dataframe with number of intersections in other column as values

How to pivot a dataframe into a square dataframe with number of intersections in value column as values where
my input dataframe is
field value
a 1
a 2
b 3
b 1
c 2
c 5
Output should be
a b c
a 2 1 1
b 1 2 0
c 1 0 2
The values in the output data frame should be the number of intersection of values in the value column.
Use cross join with crosstab:
df = df.merge(df, on='value')
df = pd.crosstab(df['field_x'], df['field_y'])
print (df)
field_y a b c
field_x
a 2 1 1
b 1 2 0
c 1 0 2
Then remove index and columns names by rename_axis:
#pandas 0.24+
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(index=None, columns=None)
print (df)
a b c
a 2 1 1
b 1 2 0
c 1 0 2
#pandas bellow
df = pd.crosstab(df['field_x'], df['field_y']).rename_axis(None).rename_axis(None, axis=1)

Pandas adding two Multiindex Dataframes

I'm trying to add two dataframes with Multiindex Columns and different index sizes together. What is the most elegant solution. And example is:
names = ['Level 0', 'Level 1']
cols1 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A2', 'B1']], names = names)
cols2 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A3', 'B1']], names = names)
df1 = pd.DataFrame(np.random.randn(1, 3), index=range(1), columns=cols1)
df2 = pd.DataFrame(np.random.randn(5, 3), index=range(5), columns=cols2)
print(df1)
print(df2)
Level 0 A B
Level 1 A1 A2 B1
0 -0.116975 -0.391591 0.446029
Level 0 A B
Level 1 A1 A3 B1
0 1.179689 0.693096 -0.102621
1 -0.913441 0.187332 1.465217
2 -0.089724 -1.907706 -0.963699
3 0.203217 -1.233399 0.006726
4 0.218911 -0.027446 0.982764
Now I try to add df1 to df2 with the logic that missing columns are just added and that the index 0 of df1 is added to all indices in df2.
So I would expect with the above numbers:
Level 0 A B
Level 1 A1 A2 A3 B1
0 1.062714 -0.391591 0.693096 0.343408
1 -1.030416 -0.391591 0.187332 1.911246
2 -0.206699 -0.391591 -1.907706 -0.51767
3 0.086242 -0.391591 -1.233399 0.452755
4 0.101936 -0.391591 -0.027446 1.428793
What is the most speed and memory efficient solution? Any help appreciated.
Setup
In [76]: df1
Out[76]:
Level 0 A B
Level 1 A1 A2 B1
0 -0.28667 1.852091 -0.134793
In [77]: df2
Out[77]:
Level 0 A B
Level 1 A1 A3 B1
0 -0.023582 -0.713594 0.487355
1 0.628819 0.764721 -1.118777
2 -0.572421 1.326448 -0.788531
3 -0.160608 1.985142 0.344845
4 -0.184555 -1.075794 0.630975
This will align the frames and fill the nan's with 0
but not broadcast
In [63]: df1a,df2a = df1.align(df2,fill_value=0)
In [64]: df1a+df2a
Out[64]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.628819 0.000000 0.764721 -1.118777
2 -0.572421 0.000000 1.326448 -0.788531
3 -0.160608 0.000000 1.985142 0.344845
4 -0.184555 0.000000 -1.075794 0.630975
This is the way to broadcast the first one
In [65]: df1a,df2a = df1.align(df2)
In [66]: df1a.ffill().fillna(0) + df2a.fillna(0)
Out[66]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.342149 1.852091 0.764721 -1.253570
2 -0.859091 1.852091 1.326448 -0.923324
3 -0.447278 1.852091 1.985142 0.210052
4 -0.471226 1.852091 -1.075794 0.496181

Categories