I have a column in my dataframe for articles that looks like this:
id link
1 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-dun-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
2 https://www.msn.com/rachat-de-soufflet-par-invivo-les-secrets-d-un-deal-%C3%A0-2-milliards-deuros/ar-BB1cKCRg
3 other link
For example the two first urls look to be the same but change here:
d-un-deal
In my dataframe I have some links that are almost similar. The content is the same but the link change, sometimes the difference between the two links is a letter having an uppercase in one of the link or just other character differing.
Example:
url1 = https://site/presidency...
url2 = https://site/Presidency...
url3 = https://site/news-of-today
url4 = same as url3 but at the end ?autoplay
How can I check all the links and delete the duplicates (similar content but the link is changing a little) ?
Here is one solution:
Find the similarity metric between two strings
You could use a metric for this. Decide which similarity you want to use.
Related
I'm new on web scraping and BeautifulSoup. I'm making a currency converter via using a site. I use this code to pull currency rate:
import requests
from bs4 import BeautifulSoup
from_ = input("WHICH CURRENCY DO YOU WANT TO CONVERT: ").upper()
to = input("WHICH CURRENCY DO YOU WANT TO CONVERT TO: ").upper()
url = requests.get(f'https://www.xe.com/currencyconverter/convert/?Amount=1&From={from_}&To={to}').text
soup = BeautifulSoup(url, 'lxml')
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').getText()
print(currency)
This is okay but it returns a full of text (e.g. 0.84311378 Euros). I want to pull only numbers that marked with red in picture:
Due to the number would always be the first element of this tag.An easy way could be :
currency_tag = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod')
print(next(iter(currency_tag)))
And result:
0.84
You can also use .contents and get the first item from it.
currency = soup.find('p', class_ = 'result__BigRate-sc-1bsijpp-1 iGrAod').contents
print(currency[0].strip())
0.84
From what I can see, the string you highlighted in the picture represents the first three characters of the resulted price.
This means that any time you try to convert some type of currency to another one, the numbers marked with red will always represent a string with a length equal to 3.
We can pull the information you need by getting a substring of the paragraph’s text. Just replace the last line you provided with:
print(currency[0:4])
This will always return a string containing exactly the characters you are looking for.
I found this link [and a few others] which talks a bit about BeautifulSoup for reading html. It mostly does what I want, grabs a title for a webpage.
def get_title(url):
html = requests.get(url).text
if len(html) > 0:
contents = BeautifulSoup(html)
title = contents.title.string
return title
return None
The issue that I run into is that sometimes articles will come back with metadata attached at the end with " - some_data". A good example is this link to a BBC Sport article which reports the title as
Jack Charlton: 1966 England World Cup winner dies aged 85 - BBC Sport
I could do something simple like cut off anything after the last '-' character
title = title.rsplit(', ', 1)[0]
But that assumes that any meta exists after a "-" value. I don't want to assume that there will never be an article who's title ends in " - part_of_title"
I found the Newspaper3k library but it's definitely more than I need - all I need is to grab a title and ensure that it's the same as what the user posted. My friend who pointed me to Newspaper3k also mentioned it could be buggy and didn't always find titles correctly, so I would be inclined to use something else if possible.
My current thought is to continue using BeautifulSoup and just add on fuzzywuzzy which would honestly also help with slight misspellings or punctuation differences. But, I would certainly prefer to start from a place that included comparing against accurate titles.
Here is how reddit handles getting title data.
https://github.com/reddit-archive/reddit/blob/40625dcc070155588d33754ef5b15712c254864b/r2/r2/lib/utils/utils.py#L255
def extract_title(data):
"""Try to extract the page title from a string of HTML.
An og:title meta tag is preferred, but will fall back to using
the <title> tag instead if one is not found. If using <title>,
also attempts to trim off the site's name from the end.
"""
bs = BeautifulSoup(data, convertEntities=BeautifulSoup.HTML_ENTITIES)
if not bs or not bs.html.head:
return
head_soup = bs.html.head
title = None
# try to find an og:title meta tag to use
og_title = (head_soup.find("meta", attrs={"property": "og:title"}) or
head_soup.find("meta", attrs={"name": "og:title"}))
if og_title:
title = og_title.get("content")
# if that failed, look for a <title> tag to use instead
if not title and head_soup.title and head_soup.title.string:
title = head_soup.title.string
# remove end part that's likely to be the site's name
# looks for last delimiter char between spaces in strings
# delimiters: |, -, emdash, endash,
# left- and right-pointing double angle quotation marks
reverse_title = title[::-1]
to_trim = re.search(u'\s[\u00ab\u00bb\u2013\u2014|-]\s',
reverse_title,
flags=re.UNICODE)
# only trim if it won't take off over half the title
if to_trim and to_trim.end() < len(title) / 2:
title = title[:-(to_trim.end())]
if not title:
return
# get rid of extraneous whitespace in the title
title = re.sub(r'\s+', ' ', title, flags=re.UNICODE)
return title.encode('utf-8').strip()
i now i can pass a tag list in soup.findAll() like soup.findAll(['h2', 'h3', 'h4,])
but for 2 of the tags, i'm only interested in specific ones. in my example :
soup.findAll('h2')[0], soup.findAll('h3')[7:11] and soup.findAll('h4')[:7]
Is there a way to do that or at least to put the specific sliced tags in the same bs4.element.ResultSet?
Thanks !
i finally went with this solution :
tags = [soup.findAll('h2')[0], soup.findAll('h3')[7:11], soup.findAll('h4')[:7]]
articles = [i for tag in tags for i in tag]
I'm having hard time trying to extract an id number from a string.
I could get it using index but it would fail for the other rows of the data-frame.
How do I extract campaignid=351154190, in a such way that would work for all rows.
only pattern is the word campaignid, need extract and store in new column in the data-frame. Performance is not crucial in this task.
Original string
https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=3
51154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposit
ion=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD
_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension
Spliting the string
x= cw.captureurl.str.split('&').str[:-1]
printing one row
print(x[25])
['https:_utm_source=googlebrand', 'utm_medium=ppc', 'utm_campaign=brand',
'utm_campaignid=35119190', 'keyword=co',
'utm_matchtype=e', 'device=m', 'utm_network=g', 'utm_adposition=1t1',
'geo=9027258', 'gclid=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'affiliate_id=CjwKCAjwnMTqBRAzEiwAEF3ndo3-
CNOsp1VT5OIxm0BuUcSWQEwtJSR5KLiJzrvjjc9FOk033DKW1xoCXlwQAvD_BwE',
'utm_content=search', 'utm_contentid=1211732930', 'placement']
It would be great if I could use something that would search for the word "campaignid" (what is my target)
Then store it in another column of the some data-frame.
I tried doing a split after split, it didn't work
I tried using for loop with if statement, didn't work also.
Use regex:
campaign_id = cw['captureurl'].str.extract('campaignid=(\\d+)')[0]
I'd recommend using urllib. In particular, the parse_qs function will get a dictionary of string arguments. https://docs.python.org/3/library/urllib.parse.html
Using your example URL we get:
from urllib.parse import parse_qs
test = 'https:_utm_source=googlebrand&utm_medium=ppc&utm_campaign=brand&utm_campaignid=351154190&keyword=aihdisadjiajdutm_matchtype=e&device=m&utm_network=g&utm_adposition=1t1&geo=9027258&gclid=CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE&affiliate_id=asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE&utm_content=search&utm_contentid=1251489456158180&placement&extension'
print(parse_qs(test))
{'https:_utm_source': ['googlebrand'],
'utm_medium': ['ppc'],
'utm_campaign': ['brand'],
'utm_campaignid': ['351154190'],
'keyword': ['aihdisadjiajdutm_matchtype=e'],
'device': ['m'],
'utm_network': ['g'],
'utm_adposition': ['1t1'],
'geo': ['9027258'],
'gclid': ['CjwKCsadjjsaopdl[psdklksfdosjfidj9FOk033DKW1xoCXlwQAvD_BwE'],
'affiliate_id': ['asdaskdosjadiasjdisaj-asdhasuigdyusagdyusagyk033DKW1xoCXlwQAvD_BwE'],
'utm_content': ['search'],
'utm_contentid': ['1251489456158180']}
To get the campaignids for the entire dataframe, we can use a .apply to get this done:
# After parsing each url's arguments, we extract the first campaignid from the dictionary's list.
df['utm_campaignid'] = df['url'].apply(lambda x: parse_qs(x)['utm_campaignid'][0])
df.head()
url utm_campaignid
0 https:_utm_source=googlebrand&utm_medium=ppc&u... 351154190
I want to extract "Twitter for iPhone" part from this string.
But I have different values in the place of "Twitter for iPhone" in 1000s of columns in a dataframe. I only need the values after ">" and before "<" from the following set of strings.
I tried df.col.str.extract('(Twitter for iPhone|Twitter for Samsung|Twitter for others)') which extracts only those 'Twitter for iPhone' values but not the others and the rest are filled with NaNs.
Implementing #CMMCD's comment, this code:
import pandas as pd
a = ["""Twitter for iPhone""",
"""Twitter for Cats"""
]
df = pd.DataFrame(a,columns=['WebLinks'])
df['WebLinks'].str.extract(r"\>(.*?)\<")
returns this result:
0 Twitter for iPhone
1 Twitter for Cats
What's happening is that r"\>(.*?)\<" means "regex string that pattern matches for anything between a closing tag (\>) and opening tag (\<)." I wouldn't recommend getting rid of the tags for this approach.
If this doesn't work, can you post the code that gave you the nans?
Try df.col.str.extract(pat = '(Twitter for (iPhone|Samsung|others))')
You can use col.str.split() with the regex pattern r'<|>' to get a list of the elements in the column and select the one you want (Note that this assumes the entire data element is the string provided)
twits=['<a href=”http://twitter.com/download/iphone“ rel=”nofollow“>Twitter for iPhone</a>',
'<a href=”http://twitter.com/download/iphone“ rel=”nofollow“>Twitter for Samsung</a>',
'<a href=”http://twitter.com/download/iphone“ rel=”nofollow“>Twitter for Others</a>']
ser=pd.Series([np.random.choice(twits,1)[0] for i in range(10)])
ser.str.split(r'<|>').str[2]
0 Twitter for Samsung
1 Twitter for iPhone
2 Twitter for iPhone
3 Twitter for Others
4 Twitter for iPhone
5 Twitter for Others
6 Twitter for Others
7 Twitter for Samsung
8 Twitter for iPhone
9 Twitter for Others