How do I extract the information from the website? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am trying to gather information of all the vessels from this website:
https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7
This is my code right now:
import selenium.webdriver as webdriver
url = "https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7"
browser = webdriver.Chrome(executable_path=r"C:\Users\CSA\OneDrive - College Sainte-Anne\Programming\PYTHON\Learning\WS\chromedriver_win32 (1)\chromedriver.exe")
browser.get(url)
browser.implicitly_wait(100)
Vessel_link = browser.find_element_by_class_name("ag-cell-content-link")
Vessel_link.click()
browser.implicitly_wait(30)
imo = browser.find_element_by_xpath('//*[#id="imo"]')
print(imo)
My output
I am using selenium, which isn't going to work because. I have several thousands of ships to extract data from and it just isn't going to be efficient. (Also, I only need to extract information from Cargo Vessels (U can find that using the filter or by looking at green signs on the vessel type column.) and I need to extract the country name(flag), the Imo and the Vessels name.
What should I use? Selenium or Bs4 + requests or other libraries? And How? I just started web scraping...
I can't get the Imo nor anything! The HTML structure is very weird.
I would appreciate any help. Thank You! :)

Instead of clicking each vessel to open up the details, you can get the information you're searching for from the results page. This will get each vessel, pull the info you wanted and click to the next page if there are more vessels:
import selenium.webdriver as webdriver
url = "https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7"
browser = webdriver.Chrome('C:\Users\CSA\OneDrive - College Sainte-Anne\Programming\PYTHON\Learning\WS\chromedriver_win32 (1)\')
browser.get(url)
browser.implicitly_wait(5)
checking_for_vessels = True
vessel_count = 0
while checking_for_vessels:
vessel_left_container = browser.find_element_by_class_name('ag-pinned-left-cols-container')
vessels_left = vessel_left_container.find_elements_by_css_selector('div[role="row"]')
vessel_right_container = browser.find_element_by_class_name("ag-body-container")
vessels_right = vessel_right_container.find_elements_by_css_selector('div[role="row"]')
for i in range(len(vessels_left)):
vessel_count += 1
vessel_country_list = vessels_left[i].find_elements_by_class_name('flag-icon')
if len(vessel_country_list) == 0:
vessel_country = 'Unknown'
else:
vessel_country = vessel_country_list[0].get_attribute('title')
vessel_name = vessels_left[i].find_element_by_class_name('ag-cell-content-link').text
vessel_imo = vessels_right[i].find_element_by_css_selector('[col-id="imo"] .ag-cell-content div').text
print('Vessel #' + str(vessel_count) + ': ' + vessel_name + ', ' + vessel_country + ', ' + vessel_imo)
pagination_container = browser.find_element_by_class_name('MuiTablePagination-actions')
page_number = pagination_container.find_element_by_css_selector('input').get_attribute('value')
max_page_number = pagination_container.find_element_by_class_name('MuiFormControl-root').get_attribute('max')
if page_number == max_page_number:
checking_for_vessels = False
else:
next_page_button = pagination_container.find_element_by_css_selector('button[title="Next page"]')
next_page_button.click()
There was one vessel that did not display a flag, so there's a check for that and the country is replaced with 'Unknown' if no flag found. The same kind of check can be done for the vessel name and imo.
The implicit wait was reduced to 5 because of the known issue of missing a flag on one vessel and waiting 100 seconds for this to be figured out was excessive. This number can be adjusted higher if you find there's issues waiting long enough to find elements.
It appears you are using a windows machine. You can place the path of your chromedriver in the PATH variable on your machine and then you don't have to use the path when you instantiate your browser driver. Obviously, your path to your chromedriver is different than mine, so hopefully what you provided is correct or else this won't work.

I like to work with bs4 but I think this info will help to.

Related

i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
When I press each item on "https://dicom.innolitics.com/ciods" this site (like CR Image, Patient, Referenced Patient Sequence ...these values) , I want to save the descriptions of the items in the right layout in a variable.
I'm trying to save the values by clicking on the items on the left.
But, I found out that none of the values in the tree were crawled!
driver = webdriver.Chrome()
url = "https://dicom.innolitics.com/ciods"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tree-table')))
table_list = []
tree_table = driver.find_element(By.CLASS_NAME, 'tree-table')
tree_rows = tree_table.find_elements(By.TAG_NAME, 'tr')
for i, row in enumerate(tree_rows):
row.click()
td = row.find_element(By.TAG_NAME, 'td')
a = td.find_element(By.CLASS_NAME, 'row-name')
row_name = a.find_element(By.TAG_NAME, 'span').text
print(f'Row {i+1} name: {row_name}')
driver.quit()
i did like this
wanna know how to crawl the values in the tree.
It'd be better if you teach me how to crawl the layout on the right :)

Syntax error when using find_elements_by_xpath [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
Web scraping beginner here.
I am trying to get the amount of items on this webpage: https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand
However when I use the len()-function, it says there is an error in the syntax.
from bs4 import BeautifulSoup
import requests
import selenium
from selenium.webdriver import Firefox
driver = Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
for item in range(items):
price = item.find_element_by_xpath("/html/body/main/div/div/div/div[2]/div/div[1]/section/div/article[16]/a/p/span[1]")
print(price)
It then outputs this error:
File "C:/Users/rasmu/PycharmProjects/du nu ffs/jsscrape.py", line 13
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
^
SyntaxError: invalid syntax
Process finished with exit code 1
Try this:
items = len(driver.find_elements_by_xpath("//*[#id='product-12257648']"))
You need double quotes surrounding the XPath.
If you want all prices, you can refactor your code as such --
from selenium import webdriver
# start driver, navigate to url
driver = webdriver.Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
# iterate product price elements
for item in driver.find_elements_by_xpath("//p[./span[#data-auto-id='productTilePrice']]"):
# print price text of element
print(item.text)
driver.close()

How to get a specific text slice in string [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to work with Instagram
So, say I have a link https://www.instagram.com/p/Bn4Lmo_j0Jc/
And I want to get a Bn4Lmo_j0jc only. I could just remove everthing before this ID and the last /
But what if my link looks like this:
https://www.instagram.com/p/Bn4Lmo_j0Jc/?taken-by=instagram or this https://www.instagram.com/p/Bn1GpYyBFSl/?hl=en&taken-by=zaralarsson so there is no exact number of characters I need to remove. What will be the easiest way to solve this?
how about this?
import urllib
url = 'https://www.instagram.com/p/Bn4Lmo_j0Jc/'
parts = urllib.parse.urlparse(url)
parts.path
'/p/Bn4Lmo_j0Jc/'
from urllib import parse
def getId(url):
return parse.urlparse(url).path[3:-1]
print(getId('https://www.instagram.com/p/Bn1GpYyBFSl/?hl=en&taken-by=zaralarsson'))
print(getId('https://www.instagram.com/p/Bn4Lmo_j0Jc/'))
print(getId('https://www.instagram.com/p/Bn4Lmo_j0Jc/?taken-by=instagram'))
Output :
Bn1GpYyBFSl
Bn4Lmo_j0Jc
Bn4Lmo_j0Jc
You can use regex here. It can also deal with if your url has multiple /p/ after ID field in which you are concerned
import re
a=['https://www.instagram.com/p/Bn1GpYyBFSl/?hl=en&taken-by=zaralarsson',
'https://www.instagram.com/p/Bn4Lmo_j0Jc/',
'https://www.instagram.com/p/Bn4Lmo_j0Jc/?taken-by=instagram/p/12321']
[re.findall('/p/(\w{1,})',i)[0] for i in a]
lst = link.split("/")
lst[-1] if not lst[-1].startswith("?") and lst[-1] else lst[-2]
where link is your link string.
(The result is the last element in lst, if it doesn't start with ? and is not empty - else the result is the last but one element.)
Consistent Format
Given that you will always have a URL https://instagram.com/p/, all you would need is to use a string interpreter.
base_url = 'https://instagram.com/p/'
main = 'https://www.instagram.com/p/Bn4Lmo_j0Jc/?taken-by=instagram'
# remove your base url
# split on separator '/'
# select the ID in index [0]
main.replace(base_url,'').split('/')[0]
'Bn4Lmo_j0Jc'
For Looping
If you have a list of URLs that you want to extract and capture:
url_base = 'https://instagram.com/p/'
url_list = [url1,url2,url3]
id_list = []
for url in url_list:
id_list.append(url.replace(url_base,'').split('/')[0])

Create a script that will download a value from text on web pages [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
I'm looking to capture the highest wind gust value per day from each weather forecast office. I'm not finding any tabular data so I think I need to just create a script that can extract that from the web pages.
E.g. web page: http://forecast.weather.gov/product.php?site=JAN&issuedby=ORD&product=CLI&format=CI&version=5&glossary=0
About halfway down, I just want to capture the "Highest Gust Speed" which for Oct. 30th at this station would be 23 MPH.
Would it be possible to do this with say, Python? I would need to run the script every day to capture the previous day's highest wind gust for all weather stations.
I'm wondering if I could just populate a table with the links to each station and go from there. Thank you.
Edited
I pieced together this code that seems to work. I however found this data in txt files that will be easier to deal with. Thank you.
import urllib2, csv
url="http://forecast.weather.gov/product.php?
site=JAN&issuedby=ORD&product=CLI&format=CI&version=5&glossary=0"
downloaded_data = urllib2.urlopen(url)
#csv_data = csv.reader(downloaded_data)
row2 = ''
for row in downloaded_data:
row2 = row2 + row
start = row2.find('HIGHEST GUST SPEED ') + 21
end = row2.find('HIGHEST GUST DIRECTION', start)
print int(row2[start:end])
It sounds like you are wanting to scrape a web site. In that case I would use python's urllib and beautiful soup lib.
EDIT:
I just check out your link and I don't think beautiful soup really is going to matter in this case. I would still use urllib, but once you get that object, you'll have to parse through that data looking for what you need. It is a bit hacky, but should work. I'll have to check back and see how things came about.
BUT, you could use beautiful soup to extract JUST the plain text to make your plain text parsing a bit easier?. Anyway, just a thought!
Once you get that data, you can create whatever logic you want to check if the previous value is greater than your last pass. Once you figure out that portion, going out and getting the data. Just create a init.d script and forget about it.
# example urllib
def requesturl(self, url):
f = urllib.urlopen(url)
html = f.read()
return html
# beautiful soup
def beautifyhtml(self, html):
currentprice_id = 'yfs_l84_' + self.s.lower()
current_change_id = 'yfs_c63_' + self.s.lower()
current_percent_change_id = 'yfs_p43_' + self.s.lower()
find = []
find.append(currentprice_id)
find.append(current_change_id)
find.append(current_percent_change_id)
soup = BeautifulSoup(html)
# title of the sites - has stock quote
#title = soup.title.string
#print(title)
# p is where the guts of the information I would want to get
#soup.find_all('p')
color = soup.find_all('span', id=current_change_id)[0].img['alt']
# drilled down version to get current price:
found = []
for item in find:
found.append(soup.find_all('span', id=item)[0].string)
found.insert(0, self.s.upper())
found.append(color)
return found

How to find backlinks in a website with python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am kind of stuck with this situation, I want to find backlinks of websites, I cannot find how to do it, here is my regex:
readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))
What I want to do, to find backlinks, is that, finding links that starts with http but not the links that include google, and I cannot figure out how to manage this?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div>hello</div>
Not this one"
Link 1
Link 2
"""
def processor(tag):
href = tag.get('href')
if not href: return False
return True if (href.find("google") == -1) else False
soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links
--output:--
[Link 2]
However, it may be more efficient just to get all the links starting with http, then search those links for links that do not have 'google' in their hrefs:
http_links = soup.findAll('a', href=re.compile(r"^http"))
results = [a for a in http_links if a['href'].find('google') == -1]
print results
--output:--
[Link 2]
Here is a regexp that matches http pages but not if including google:
re.compile("(?!.*google)^http://(www.)?.*")

Categories