assign column names to the output of Series.str.extract() - python

Im using
df[colname].str.extract(regex)
to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:
df[colname].str.extract(regex, columns=cnames)
where:
cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'
Its possible with a clunky construction like:
df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))
Or else I could embed the column names in the regex as named groups, so the regex changes to:
regex = r'(?P<col1>sometext\w)_(?P<col2>aa|bb)_(?P<col3>\d+-\d)'
Am i missing something here, is there a simpler way?
thanks

What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.
Your first solution using .rename() would not be robust if you had some columns with the names 0, 1 and 2 already.
IMO the regex solution is the best but you could start to use something like .pipe() to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.
def extract_colnames(df, column, sep, cnames, drop_col=True):
if drop_col:
drop_col = [column]
else:
drop_col = []
regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)
cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
sep='_', cnames=cnames, drop_col=True)

Related

Is there a way in Python/Pandas to use a generic variable name with a wildcard to select all similar columns?

In Stata, if I typed Week_*, it would select all columns Week_1, Week_2, etc. Is there a similar way to do this in Python/Pandas?
Code example, including last line for what I want to do.
# One-hot Encode Week: Create variables Week_1, Week_2, ... etc.
dt_temp0 = dt_temp0.join(pd.get_dummies(dt_temp0['Week'],prefix='Week'))
# Features to Use
feat_cols = ['lag2_tfk_total','lag3_tfk_total','lag2_Trips_pp','lag3_Trips_pp',
'ClinicID_fac', 'Week_*']
x_train = dt_temp1.loc[dt_temp1['train'] == 1,feat_cols]
You could select your week columns with a list comprehension:
week_cols = [col for col in df_temp1.columns if col.startswith('Week_')]
feat_cols = ['lag2_tfk_total','lag3_tfk_total','lag2_Trips_pp','lag3_Trips_pp',
'ClinicID_fac', *week_cols]
You can combine these into one line if you want.
I actually found another way to do this, as well... using filter(). Then you just have to concatenate the string arrays together. Thanks for all the help!
week_cols = dt_temp0.filter(regex = "Week_" ).columns.tolist()
feat_cols = ['ClinicID_fac','lag2_tfk_total','lag3_tfk_total','lag2_Trips_pp','lag3_Trips_pp'] + week_cols

Pandas apply multiple function with list

I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Filtering websites based on specific string

I am currently working on an analysis of urls and want to find urls which match a specific word. Those URLs are in an pandas DataFrame column and I want to filter for specific words in the title of the URL.
What I did so far:
data['new'] = data['SOURCEURL'].str.extract("(" + "|".join(filter3) +")", expand=False)
The problem about this is that the filter that I apply is an abbreviation ('ecb') which is often also used in the end of a link.
http://www.ntnews.com.au/news/national/senate-president-stephen-parry-believes-he-is-a-british-citizen/news-story/b2d3a3442544937f85508135401a3f84?nk=f19e52d2acd9588ecb494c03f21fed8c-1509598074
In the last '/'-section. How can I just filter for 'ecb' occurences which occur in textish surrounding? Something like www.xyz.com/news/national/ecb-press-realease/b2dse332313 and which does not extract the occurence of ecb in a hash or something similar as above. Is this possible in an easy way?
Thanks a lot!
Perhaps you could split the URL into words and filter out all words that are not in an English dictionary? For example using PyEnchant:
import enchant
d = enchant.Dict("en_US")
filtered_words = [x for x in words if d.check(x)]
One easy solution is check only in strings before last /:
df = pd.DataFrame({'SOURCEURL':['http://au/news/nat/cit/news-story/b2ecb',
'http://au/news/nat/cit/news-story/b2d88ecb494']})
print (df)
SOURCEURL
0 http://au/news/nat/cit/news-story/b2ecb
1 http://au/news/nat/cit/news-story/b2d88ecb494
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.rsplit('/', 1).str[0]
.str.extract("(" + "|".join(filter3) +")", expand=False))
Another similar solution:
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.extract('(.*)/', expand=False)
.str.extract("(" + "|".join(filter3) +")", expand=False))
print (df)
SOURCEURL new
0 http://au/news/nat/cit/news-story/b2ecb NaN
1 http://au/news/nat/cit/news-story/b2d88ecb494 NaN
Another possible approach here. You're probably looking to exclude parameters passed at the end of the URL, which I believe is the only place you'd see either a ? or an =
In this case you can evaluate each split string section as True/False and return the boolean of the sum to get True/False.
validation = bool(sum([True if sub in x and "?" not in x and '=' not in x else False for x in url.split('/')]))

Find String Pattern Match in Pandas Dataframe and Return Matched Strin

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:
col1 | col2
-----------
x | a,b
listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)
def test_for_pattern(x):
if re.search(pattern, x):
return pattern
else:
return x
#also can use col2.str.contains(pattern) for same results
The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.
Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:
def matching_func(file1, file2):
file1 = pd.read_csv(fin)
file2 = pd.read_excel(fin1, 0, skiprows=1)
pattern = '|'.join(file1[col1].tolist())
file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
if re.search(pattern, x) else None)
I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:
df[col1].str.extract('(word1|word2)')
Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.
My final and preferred version with vectorized string method in pandas 0.13:
Using values from one column to extract from a second column:
df[col1].str.extract('({})'.format('|'.join(df[col2]))
You might like to use extract, or one of the other vectorised string methods:
In [11]: s = pd.Series(['a', 'a,b'])
In [12]: s.str.extract('([cdfb])')
Out[12]:
0 NaN
1 b
dtype: object

Categories