Add a string into a url python - python

I am trying to add a string in the middle of an url. Somehow my output looks like this:
http://www.Holiday.com/('Woman',)/Beach
http://www.Holiday.com/('Men',)/Beach
Somehow it should look like this:
http://www.Holiday.com/Woman/Beach
http://www.Holiday.com/Men/Beach
The code which I am using looks like the following:
list = {'Woman','Men'}
url_test = 'http://www.Holiday.com/{}/Beach'
for i in zip(list):
url = url_test.format(str(i))
print(url)

Almost there. Just no need for zip:
items = {'Woman','Men'} # notice that this is a `set` and not a list
url_test = 'http://www.Holiday.com/{}/Beach'
for i in items:
url = url_test.format(i)
print(url)
The purpose of the zip function is to join several collections by the index if the item. When the zip joins the values from each collection it places them in a tuple which it's __str__ representation is exactly what you got.
Here you just want to iterate the items in the collection

You can try this also, And please don't use list as a variable name.
lst = {'Woman','Men'}
url_test = 'http://www.Holiday.com/%s/Beach'
for i in lst:
url = url_test %i
print url

from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
url = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
html = urlopen(url)
url_list = BS(html, 'lxml')
type(url_list)
all_links = url_list.find_all('a', href=re.compile("/title/tt"))
for link in all_links:
print(link.get("href"))
all_urls = link.get("href")
url_test = 'http://www.imdb.com/{}/'
for i in all_urls:
urls = url_test.format(i)
print(urls)
this is the code to scrape the urls of all the 250 movies from the main url.
but the code gives the result as ------
http://www.imdb.com///
http://www.imdb.com/t/
http://www.imdb.com/i/
http://www.imdb.com/t/
http://www.imdb.com/l/
http://www.imdb.com/e/
http://www.imdb.com///
and so on ...
how can i split 'all_urls' using a comma, or how can I make a list of urls in
'all_urls'....

Related

python - Retrieve and save links from webpage but only one per domain

I'm having a bit of trouble trying to save the links from a website into a list without repeating urls with same domain
Example:
www.python.org/download and www.python.org/about
should only save the first one (www.python.org/download) and not repeat it later
This is what i've got so far
from bs4 import BeautifulSoup
import requests
from urllib.parse import urlparse
url = "https://docs.python.org/3/library/urllib.request.html#module-urllib.request"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
atag = doc.find_all('a', href=True)
links = []
#below should be some kind of for loop
As a one-liner:
links = {nl for a in doc.find_all('a', href=True) if (nl := urlparse(a["href"]).netloc) != ""}
Explained:
links = set() # define empty set
for a in doc.find_all('a', href=True): # loop over every <a> element
nl = urlparse(a["href"]).netloc # get netloc from url
if nl:
links.add(nl) # add to set if exists
output:
{'www.w3.org', 'datatracker.ietf.org', 'www.python.org', 'requests.readthedocs.io', 'github.com', 'www.sphinx-doc.org'}

How to get the href value of a link with bs4?

I need help where I can extract all the matches from 2020/2021's URLs from this [website][1] and scrape them.
I am sending a request to this link.
The section of the HTML that I want to retrieve is this part:
Here's the code that I am using:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
website = 'https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results'
response = requests.get(website)
soup = BeautifulSoup(response.content,'html.parser')
match_result = soup.find_all('a',{'class':'match-info-link-FIXTURES'});
soup.get('href')
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
url_joined = []
for link_2 in url_part_2:
url_joined.append(urllib.parse.urljoin(url_part_1,link_2))
first_link = url_joined[0]
match_url = soup.find_all('div',{'class':'link-container border-bottom'});
soup.get('href')
url_part_3 = 'https://www.espncricinfo.com/'
url_part_4 = []
for item in match_result:
url_part_4.append(item.get('href'))
print(url_part_4)
[1]: https://www.espncricinfo.com/series/ipl-2020-21-1210595/match-results
You don't need the second item.find_all('a',{'class':'match-info-link-FIXTURES'}): call below for item in match_result: since you already have the tags with the hrefs.
You can get the href with item.get('href').
You can do:
url_part_1 = 'https://www.espncricinfo.com/'
url_part_2 = []
for item in match_result:
url_part_2.append(item.get('href'))
The result will look something like:
['/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-final-1237181/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-sunrisers-hyderabad-qualifier-2-1237180/full-scorecard',
'/series/ipl-2020-21-1210595/royal-challengers-bangalore-vs-sunrisers-hyderabad-eliminator-1237178/full-scorecard',
'/series/ipl-2020-21-1210595/delhi-capitals-vs-mumbai-indians-qualifier-1-1237177/full-scorecard',
'/series/ipl-2020-21-1210595/sunrisers-hyderabad-vs-mumbai-indians-56th-match-1216495/full-scorecard',
...
]
From official doc's
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_.
Try
soup.find_all("a", class_="match-info-link-FIXTURES")

Using startswith function to filter a list of urls

I have the following piece of code which extracts all links from a page and puts them in a list (links=[]), which is then passed to the function filter_links() .
I wish to filter out any links that are not from the same domain as the starting link, aka the first link in the list. This is what I have:
import requests
from bs4 import BeautifulSoup
import re
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
print(filter_links(links))
I have used the built-in startswith function, but its filtering out everything except the starting url.
Eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url.I think I could use regex but this function should work too?
Try this :
import requests
from bs4 import BeautifulSoup
import re
import tldextract
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
ext = tldextract.extract(start_url)
domain = ext.domain
filtered_links = []
for link in links:
if domain in link:
filtered_links.append(link)
return filtered_links
print(filter_links(links))
Note :
You need to get that return statement out of the for loop. It is just returning the result after iterating over just one element and thus only the first item inside a list is only getting returned.
Use tldextract module to better extract the domain name from the urls. If you want to explicitly check whether the links starts with links[0], it's up to you.
Output :
['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']
Okay so you made an indentation error in filter_links(links). The function should be like this
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
return filtered_links
Notice that in your code, you kept the return statement inside the for loop so, the for loop gets executed once and then returns the list.
Hope this helps :)
Possible Solution
What about if you kept all links which 'contain' the domain?
For example
import pandas as pd
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]
# results in a dataframe with links containing "enzymebiosystems".
If you want to search multiple domains, see this answer

BeautifulSoup: Extracting string between tags does not seem to work

I am currently parsing this url. the Url will be the argument for the parse function.
def parse(sitemap):
req = urllib.request.urlopen(sitemap)
soup = BeautifulSoup(req, 'lxml')
soup.prettify()
inventory_url = []
inventory_url_set = set()
for item in soup.find_all('url'):
print(item.find('lastmod'))
# print(item.find('lastmod').text)
inventory_url_set.add(item.find('loc').text)
However, item.find('lastmod').text retuns an AttributeError whereas if I were to print the whole tag item.find('lastmod') it works fine.
I'd like to only obtain the text in between the 'lastmod' tag from within each 'item'.
Thanks
Not all of the url entries contain a lastmod, so you need to test for that. If you use a dictionary, you could store the lastmod as values and still benefit from having unique URLs as follows:
from bs4 import BeautifulSoup
import urllib.request
def parse(sitemap):
req = urllib.request.urlopen(sitemap)
soup = BeautifulSoup(req, 'lxml')
inventory_urls = {}
for url in soup.find_all('url'):
if url.lastmod:
lastmod = url.lastmod.text
else:
lastmod = None
inventory_urls[url.loc.text] = lastmod
for url, lastmod in inventory_urls.items():
print(lastmod, url)
parse("https://www.kith.com/sitemap_products_1.xml")
This would give you a list starting as follows:
2017-02-12T03:55:25Z https://kith.com/products/adidas-originals-stan-smith-wool-pk-grey-white
2017-03-13T18:55:24Z https://kith.com/products/norse-projects-niels-pocket-boucle-tee-black
2017-03-15T17:20:47Z https://kith.com/products/ronnie-fieg-x-fracap-rf120-rust
2017-03-17T01:30:25Z https://kith.com/products/new-balance-696-birch
2017-01-23T08:43:56Z https://kith.com/products/ronnie-fieg-x-diamond-supply-co-x-asics-gel-lyte-v-1
2017-03-17T00:41:03Z https://kith.com/products/off-white-diagonal-ferns-hoodie-black
2017-03-16T15:01:55Z https://kith.com/products/norse-projects-skagen-bubble-crewneck-charcoal
2017-02-21T15:57:56Z https://kith.com/products/vasque-eriksson-gtx-brown-black

How to extract data from all urls, not just the first

This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.
I've been working on this for 12hrs+ today, what am I missing in order get the correct results?
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
print r.text
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = pages.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
web_address = link.text
gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
writer = csv.writer(file)
for row in gyms:
writer.writerow([row])
You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.
You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.
There are at least two possible solutions:
Append every parsed line of url to a list and return that list.
Move you processing code in the loops and append the parsed data to gyms in the loop.
As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
# etc. etc.
You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows().
Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.

Categories