Extract TLDs , SLDs from a dataframe column into new columns - python

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.

You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr

You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

Related

Append data frame issue

data frame issue
ID
C1
C2
M1
1
A
B
X
2
A
Y
3
C
W
4
G
H
Z
result wanted
ID
C
1
A
1
B
2
B
3
C
4
C
4
G
The main problem is the first dataset today has C1 and C2
tomorrow we could have C1 , C2 , C3 ...Cn
the filename will be provided and my task is read it and get the result regardless of how many C related columns the file may have. column: M1 is not needed.
-----what I tried:
df = pd.read_csv (r"C:\Users\JIRAdata_TEST.csv")
df = df.filter(regex='ID|C')
print(df2)
will return all ID and C related columns, and remove the M1 column as part of data clean up--dont know if that helps.
then...am stuck!
Use df.melt with df.dropna:
In [1295]: x = df.filter(regex='ID|C').melt('ID', value_name='C').sort_values('ID').dropna().drop('variable', 1)
In [1296]: x
Out[1296]:
ID C
0 1 A
4 1 B
5 2 A
2 3 C
3 4 G
7 4 H

How to sort dataframe based on column whose entries consist of letters and numbers?

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame(
{
'pos': ['A1', 'B03', 'A2', 'B01', 'A3', 'B02'],
'ignore': range(6)
}
)
pos ignore
0 A1 0
1 A03 1
2 A2 2
3 B01 3
4 B3 4
5 B02 5
Which I would like to sort according to pos whereby
it should be first sorted according to the number and then according to the letter and
leading 0s should be ignored,
so the desired outcome is
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4
I currently do it like this:
df[['let', 'num']] = df['pos'].str.extract(
'([A-Za-z]+)([0-9]+)'
)
df['num'] = df['num'].astype(int)
df = (
df.sort_values(['num', 'let'])
.drop(['let', 'num'], axis=1)
.reset_index(drop=True)
)
That works, but what I don't like is that I need two temporary columns I later have to drop again. Is there a more straightforward way of doing it?
You can use argsort with zfill and first sort on the numbers as 01, 02, 03 etc. This way you don't have to assign / drop columns:
val = df['pos'].str.extract('(\D+)(\d+)')
df.loc[(val[1].str.zfill(2) + val[0]).argsort()]
pos ignore
0 A1 0
3 B01 3
2 A2 2
5 B02 5
4 A3 4
1 B03 1
Here's one way:
import re
def extract_parts(x):
groups = re.match('([A-Za-z]+)([0-9]+)', x)
return (int(groups[2]), groups[1])
df.reindex(df.pos.transform(extract_parts).sort_values().index).reset_index(drop=True)
Output
Out[1]:
pos ignore
0 A1 0
1 B01 3
2 A2 2
3 B02 5
4 A03 1
5 B3 4

Faster copying of pandas data with some conditions

I have a dataframe(df_main) into which I want to copy the data based on finding the necessary columns from another dataframe(df_data).
df_data
name Index par_1 par_2 ... par_n
0 A1 1 a0 b0
1 A1 2 a1
2 A1 3 a2
3 A1 4 a3
4 A2 2 a4
...
df_main
name Index_0 Index_1
0 A1 1 2
1 A1 1 3
2 A1 1 4
3 A1 2 3
4 A1 2 4
5 A1 3 4
...
I want to copy the parameter columns from df_data into df_main with the condition that all the parameters with same name and index in df_data row are copied to the df_main.
I have made the following implementation using for loops which is practically too slow to be used:
def data_copy(df, df_data, indice):
'''indice: whether Index_0 or Index_1 is being checked'''
names = df['name'].unique()
# We get all different names in the dataset to loop over
for name in tqdm.tqdm(names):
# Get unique index for a specific name
indexes = df[df['name']== name][indice].unique()
# Looping over all indexes
for index in indexes:
# From df_data, get the data of all cols of specific name and data
data = df_data[(df_data['Index']==index) & (df_data['name'] == name)]
# columns: Only the cols of structure's data
req_data = data[columns]
for col in columns:
# For each col (e.g. g1, g2, etc), get the val of a specific index
val = df_struc.loc[(df_data['Index']==index) & (df_data['name'] == name), col]
df.loc[(df[indice] == index) & (df['name']== name), col] = val[val.index.item()]
return df
df_main = data_copy(df_main, df_data, 'Index_0')
This gives me what I required as:
df_main
name Index_0 Index_1 par_1 par_2 ...
0 A1 1 2 a0
1 A1 1 3 a0
2 A1 1 4 a0
3 A1 2 3 a1
4 A1 2 4 a1
5 A1 3 4 a2
However, running it on a really big data requires a lot of time. What's the best way of avoiding the for loops for a faster implementation?
For each data frame, you can create a new column that will concat both name and index. See below :
import pandas as pd
df1 = {'name':['A1','A1'],'index':['1','2'],'par_1':['a0','a1']}
df1 = pd.DataFrame(data=df1)
df1['new'] = df1['name'] + df1['index']
df1
df2 = {'name':['A1','A1'],'index_0':['1','2'],'index_1':['2','3']}
df2 = pd.DataFrame(data=df2)
df2['new'] = df2['name'] + df2['index_0']
df2
for i, row in df1.iterrows():
df2.loc[(df2['new'] == row['new']) , 'par_1'] = row['par_1']
df2
Result :
name index_0 index_1 new par_1
0 A1 1 2 A11 a0
1 A1 2 3 A12 a1

Adding a column in dataframes based on similar columns in them

I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5

Extract all the following rows in pandas

I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get:

Categories