Filtering websites based on specific string - python

I am currently working on an analysis of urls and want to find urls which match a specific word. Those URLs are in an pandas DataFrame column and I want to filter for specific words in the title of the URL.
What I did so far:
data['new'] = data['SOURCEURL'].str.extract("(" + "|".join(filter3) +")", expand=False)
The problem about this is that the filter that I apply is an abbreviation ('ecb') which is often also used in the end of a link.
http://www.ntnews.com.au/news/national/senate-president-stephen-parry-believes-he-is-a-british-citizen/news-story/b2d3a3442544937f85508135401a3f84?nk=f19e52d2acd9588ecb494c03f21fed8c-1509598074
In the last '/'-section. How can I just filter for 'ecb' occurences which occur in textish surrounding? Something like www.xyz.com/news/national/ecb-press-realease/b2dse332313 and which does not extract the occurence of ecb in a hash or something similar as above. Is this possible in an easy way?
Thanks a lot!

Perhaps you could split the URL into words and filter out all words that are not in an English dictionary? For example using PyEnchant:
import enchant
d = enchant.Dict("en_US")
filtered_words = [x for x in words if d.check(x)]

One easy solution is check only in strings before last /:
df = pd.DataFrame({'SOURCEURL':['http://au/news/nat/cit/news-story/b2ecb',
'http://au/news/nat/cit/news-story/b2d88ecb494']})
print (df)
SOURCEURL
0 http://au/news/nat/cit/news-story/b2ecb
1 http://au/news/nat/cit/news-story/b2d88ecb494
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.rsplit('/', 1).str[0]
.str.extract("(" + "|".join(filter3) +")", expand=False))
Another similar solution:
filter3 = ['ecb']
df['new'] = (df['SOURCEURL'].str.extract('(.*)/', expand=False)
.str.extract("(" + "|".join(filter3) +")", expand=False))
print (df)
SOURCEURL new
0 http://au/news/nat/cit/news-story/b2ecb NaN
1 http://au/news/nat/cit/news-story/b2d88ecb494 NaN

Another possible approach here. You're probably looking to exclude parameters passed at the end of the URL, which I believe is the only place you'd see either a ? or an =
In this case you can evaluate each split string section as True/False and return the boolean of the sum to get True/False.
validation = bool(sum([True if sub in x and "?" not in x and '=' not in x else False for x in url.split('/')]))

Related

Removing specific words within a column in pandas

I am trying to clean lists within a column in my dataframe from all the terms that do not make sense.
For example
Col New_Col
VM ['#']
JS [ '/','/UTENTI/','//utilsit/promo', '/notifiche/']
www.facebook.com ['https://www.facebook.com/','https://twitter.com/']
FA ['/nordest/venezia/','/nordest/treviso/']
I would like to remove from each list(row) in the column all the words that
do not start with https, http or //
contains Col as subset in New_Col (for example: www.facebook.com is included in https://www.facebook.com/ so I should remove it, does not matter if it starts with https)
I tried to write this code:
prefixes=['http','https','//']
for word in df['New_Col']:
if word.startswith(prefixes):
list.remove(word)
print (df['New_Col'])
however it says that
'list' object has no attribute 'startswith'
(Attribute error).
I think I am considering in my code above a list and not column with lists.
Can you please help me to understand how to do it?
Use, DataFrame.apply on axis=1 along with custom filter function fx:
import re
fx = lambda s: [w for w in s['New_Col'] if s['Col'] not in w and re.match(r'^https?|//', w)]
df['New_Col'] = df.apply(fx, axis=1)
# print(df)
Col New_Col
0 VM []
1 JS [//utilsit/promo]
2 www.facebook.com [https://twitter.com/]
3 FA []
make a function to remove the words you want using regular expression and then apply it on the dataframe column as below:
df['ColName'].apply(lambda x: func(x))
Here func is the function that will take each row of the ColName column and will return your required result

How to trim string from reverse in Pandas column

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

assign column names to the output of Series.str.extract()

Im using
df[colname].str.extract(regex)
to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:
df[colname].str.extract(regex, columns=cnames)
where:
cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'
Its possible with a clunky construction like:
df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))
Or else I could embed the column names in the regex as named groups, so the regex changes to:
regex = r'(?P<col1>sometext\w)_(?P<col2>aa|bb)_(?P<col3>\d+-\d)'
Am i missing something here, is there a simpler way?
thanks
What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.
Your first solution using .rename() would not be robust if you had some columns with the names 0, 1 and 2 already.
IMO the regex solution is the best but you could start to use something like .pipe() to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.
def extract_colnames(df, column, sep, cnames, drop_col=True):
if drop_col:
drop_col = [column]
else:
drop_col = []
regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)
cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
sep='_', cnames=cnames, drop_col=True)

Find String Pattern Match in Pandas Dataframe and Return Matched Strin

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:
col1 | col2
-----------
x | a,b
listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)
def test_for_pattern(x):
if re.search(pattern, x):
return pattern
else:
return x
#also can use col2.str.contains(pattern) for same results
The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.
Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:
def matching_func(file1, file2):
file1 = pd.read_csv(fin)
file2 = pd.read_excel(fin1, 0, skiprows=1)
pattern = '|'.join(file1[col1].tolist())
file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
if re.search(pattern, x) else None)
I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:
df[col1].str.extract('(word1|word2)')
Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.
My final and preferred version with vectorized string method in pandas 0.13:
Using values from one column to extract from a second column:
df[col1].str.extract('({})'.format('|'.join(df[col2]))
You might like to use extract, or one of the other vectorised string methods:
In [11]: s = pd.Series(['a', 'a,b'])
In [12]: s.str.extract('([cdfb])')
Out[12]:
0 NaN
1 b
dtype: object

Categories