1.input dataframe with random num values:
ID num
a 2
a,b 3
b 1
c,e 4
I have another dataframe:
ID name
a a1
b b5
c c4
d d1
e e6
2.Expected result : I want to map the df1 with df2 on ID and stored as another column:
ID num ID_name
a 2 a1
a,b 3 a1,b5
b 1 b5
c,e 4 c4,e6
3.code i tried:
df1["ID_name"] = df1["ID"].map(df2)
df1
But the values are not getting mapped and showing NAN for most of the values
We can use Series.str.split then use Series.map and groupby on the elements.
df["ID_name"] = (
df["ID"]
.str.split(",")
.explode()
.map(df2.set_index("ID")["name"])
.groupby(level=0)
.agg(",".join)
)
ID num ID_name
0 a 2 a1
1 a,b 3 a1,b5
2 b 1 b5
3 c,e 4 c4,e6
Related
I have a dataframe df1:
id1 id2
a1 b1
c1 d1
e1 d1
g1 h1
and df2:
id value
a1 10
b1 9
c1 7
d1 11
e1 12
g1 5
h1 8
I want to keep rows from df1 only if their values from value column in df2 differ (gap) no higher than 1. So desired output is:
id1 id2
a1 b1
e1 d1
row c1 d1 was removed since gap between 7 and 11 is higher than 1. same thing with g1 h1. How to do that?
Here's one way using boolean indexing. The idea is to stack the Ids' in df1 get its corresponding values from df2, then filter the rows where the difference is less than 1:
out = df1.loc[df1.stack().map(df2.set_index('id')['value']).droplevel(-1).groupby(level=0).diff().abs().dropna().le(1).pipe(lambda x: x[x].index)]
Output:
id1 id2
0 a1 b1
2 e1 d1
IIUC:
df1[df1.applymap(df2.set_index('id').value.get).eval('abs(id1 - id2)').le(1)]
id1 id2
0 a1 b1
2 e1 d1
Longer Answer
# Callable I'll need in `applymap`
# it basically translates `df2` into
# a function that returns `'value'`
# when you pass `'id'`
c = df2.set_index('id').value.get
# `applymap` applies a callable to each dataframe cell
df1_applied = df1.applymap(c)
print(df1_applied)
id1 id2
0 10 9
1 7 11
2 12 11
3 5 8
# `eval` takes a string argument that describes what
# calculation to do. See docs for more
df1_applied_evaled = df1_applied.eval('abs(id1 - id2)')
print(df1_applied_evaled)
0 1
1 4
2 1
3 3
dtype: int64
# now just boolean slice your way to the end
df1[df1_applied_evaled.le(1)]
id1 id2
0 a1 b1
2 e1 d1
It's easy and intuitive to do this with datar, a re-imagining of pandas APIs:
>>> from datar.all import f, tibble, left_join, mutate, abs, filter, select
>>>
>>> df1 = tibble(
... id1=["a1", "c1", "e1", "g1"],
... id2=["b1", "d1", "d1", "h1"],
... )
>>>
>>> df2 = tibble(
... id=["a1", "b1", "c1", "d1", "e1", "g1", "h1"],
... value=[10, 9, 7, 11, 12, 5, 8],
... )
>>>
>>> (
... df1
... >> left_join(df2, by={"id1": f.id}) # get the values of id1
... >> left_join(df2, by={"id2": f.id}) # get the values of id2
... >> mutate(diff=abs(f.value_x - f.value_y)) # calculate the diff
... >> filter(f.diff <= 1) # filter with diff <= 1
... >> select(f.id1, f.id2) # keep only desired columns
... )
id1 id2
<object> <object>
0 a1 b1
2 e1 d1
I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv
I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?
For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1
If I have a Pandas dataframe like so:
colA colB
A A1
B C1
A B1
B A1
colA has 2 unique values (A, B) and colB has 3 unique values (A1, B1 and C1).
I would like to create a new dataframe where colA and colB are all combinations and another column colC which is 1 or 0 based on the combination present in earlier df.
expected result:
colA colB colC
A A1 1
A B1 1
A C1 0
B A1 1
B B1 0
B C1 1
First create new column by DataFrame.assign filled by 1, then create MultiIndex.from_product by Series.unique values of both columns and after DataFrame.set_index use DataFrame.reindex - there is possible set value for new appended rows in colC column by fill_value parameter:
mux = pd.MultiIndex.from_product([df['colA'].unique(),
df['colB'].unique()], names=['colA','colB'])
df1 = df.assign(colC = 1).set_index(['colA','colB']).reindex(mux, fill_value=0).reset_index()
print (df1)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Alternative is use reshape by DataFrame.set_index, Series.unstack and DataFrame.stack:
df1 = (df.assign(colC = 1)
.set_index(['colA','colB'])['colC']
.unstack(fill_value=0)
.stack()
.reset_index(name='ColC'))
print (df1)
colA colB ColC
0 A A1 1
1 A B1 1
2 A C1 0
3 B A1 1
4 B B1 0
5 B C1 1
Another solution is create new DataFrame by itertools.product, DataFrame.merge with indicator=True, rename column and set by compare by both and casting to integer for True/False to 1/0 mapping:
from itertools import product
df1 = pd.DataFrame(product(df['colA'].unique(), df['colB'].unique()), columns=['colA','colB'])
df = df1.merge(df, how='left', indicator=True).rename(columns={'_merge':'colC'})
df['colC'] = df['colC'].eq('both').astype(int)
print (df)
colA colB colC
0 A A1 1
1 A C1 0
2 A B1 1
3 B A1 1
4 B C1 1
5 B B1 0
Last if necessary add sorting by both columns by DataFrame.sort_values:
df1 = df1.sort_values(['colA','colB'])
I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get: