Faster copying of pandas data with some conditions

Faster copying of pandas data with some conditions - python

I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?

For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1

Related

How to do left join with larger table, keeping left tables size?

I have a dataframe1:
id val
a1 0
a1 5
a2 3
and dataframe2:
id type1 type2
a1 main k
a2 secondary b
a3 old k
a4 deleted n
i want to join type column to dataframe1 by id to get:
id val type1 type2
a1 0 main k
a1 5 main k
a2 3 secondary b
How could I do that? as you see output table is same shape as dataframe1? but when i use pd.merge output is larger

Try this:
out = pd.merge(dataframe1, dataframe2, how='inner', on=['id'])
Output:
id val type1 type2
a1 0 main k
a1 5 main k
a2 3 secondary b

df = pd.merge(df1, df2)
or
df = df1.merge(df2)
should work just fine.
Output
id val type1 type2
0 a1 0 main k
1 a1 5 main k
2 a2 3 secondary b

Map and merge values from another dataframe

1.input dataframe with random num values:
ID num
a 2
a,b 3
b 1
c,e 4
I have another dataframe:
ID name
a a1
b b5
c c4
d d1
e e6
2.Expected result : I want to map the df1 with df2 on ID and stored as another column:
ID num ID_name
a 2 a1
a,b 3 a1,b5
b 1 b5
c,e 4 c4,e6
3.code i tried:
df1["ID_name"] = df1["ID"].map(df2)
df1
But the values are not getting mapped and showing NAN for most of the values

We can use Series.str.split then use Series.map and groupby on the elements.
df["ID_name"] = (
df["ID"]
.str.split(",")
.explode()
.map(df2.set_index("ID")["name"])
.groupby(level=0)
.agg(",".join)
)
ID num ID_name
0 a 2 a1
1 a,b 3 a1,b5
2 b 1 b5
3 c,e 4 c4,e6

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.

You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr

You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?

One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.

using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2

If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Pandas adding two Multiindex Dataframes

I'm trying to add two dataframes with Multiindex Columns and different index sizes together. What is the most elegant solution. And example is:
names = ['Level 0', 'Level 1']
cols1 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A2', 'B1']], names = names)
cols2 = pd.MultiIndex.from_arrays([['A', 'A', 'B'],['A1', 'A3', 'B1']], names = names)
df1 = pd.DataFrame(np.random.randn(1, 3), index=range(1), columns=cols1)
df2 = pd.DataFrame(np.random.randn(5, 3), index=range(5), columns=cols2)
print(df1)
print(df2)
Level 0 A B
Level 1 A1 A2 B1
0 -0.116975 -0.391591 0.446029
Level 0 A B
Level 1 A1 A3 B1
0 1.179689 0.693096 -0.102621
1 -0.913441 0.187332 1.465217
2 -0.089724 -1.907706 -0.963699
3 0.203217 -1.233399 0.006726
4 0.218911 -0.027446 0.982764
Now I try to add df1 to df2 with the logic that missing columns are just added and that the index 0 of df1 is added to all indices in df2.
So I would expect with the above numbers:
Level 0 A B
Level 1 A1 A2 A3 B1
0 1.062714 -0.391591 0.693096 0.343408
1 -1.030416 -0.391591 0.187332 1.911246
2 -0.206699 -0.391591 -1.907706 -0.51767
3 0.086242 -0.391591 -1.233399 0.452755
4 0.101936 -0.391591 -0.027446 1.428793
What is the most speed and memory efficient solution? Any help appreciated.

Setup
In [76]: df1
Out[76]:
Level 0 A B
Level 1 A1 A2 B1
0 -0.28667 1.852091 -0.134793
In [77]: df2
Out[77]:
Level 0 A B
Level 1 A1 A3 B1
0 -0.023582 -0.713594 0.487355
1 0.628819 0.764721 -1.118777
2 -0.572421 1.326448 -0.788531
3 -0.160608 1.985142 0.344845
4 -0.184555 -1.075794 0.630975
This will align the frames and fill the nan's with 0
but not broadcast
In [63]: df1a,df2a = df1.align(df2,fill_value=0)
In [64]: df1a+df2a
Out[64]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.628819 0.000000 0.764721 -1.118777
2 -0.572421 0.000000 1.326448 -0.788531
3 -0.160608 0.000000 1.985142 0.344845
4 -0.184555 0.000000 -1.075794 0.630975
This is the way to broadcast the first one
In [65]: df1a,df2a = df1.align(df2)
In [66]: df1a.ffill().fillna(0) + df2a.fillna(0)
Out[66]:
Level 0 A B
Level 1 A1 A2 A3 B1
0 -0.310253 1.852091 -0.713594 0.352561
1 0.342149 1.852091 0.764721 -1.253570
2 -0.859091 1.852091 1.326448 -0.923324
3 -0.447278 1.852091 1.985142 0.210052
4 -0.471226 1.852091 -1.075794 0.496181

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster copying of pandas data with some conditions - python

Related

How to do left join with larger table, keeping left tables size?

Map and merge values from another dataframe

Extract TLDs , SLDs from a dataframe column into new columns

Extract all the following rows in pandas

Pandas adding two Multiindex Dataframes

Categories

Resources