i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed]

i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
When I press each item on "https://dicom.innolitics.com/ciods" this site (like CR Image, Patient, Referenced Patient Sequence ...these values) , I want to save the descriptions of the items in the right layout in a variable.
I'm trying to save the values by clicking on the items on the left.
But, I found out that none of the values in the tree were crawled!
driver = webdriver.Chrome()
url = "https://dicom.innolitics.com/ciods"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tree-table')))
table_list = []
tree_table = driver.find_element(By.CLASS_NAME, 'tree-table')
tree_rows = tree_table.find_elements(By.TAG_NAME, 'tr')
for i, row in enumerate(tree_rows):
row.click()
td = row.find_element(By.TAG_NAME, 'td')
a = td.find_element(By.CLASS_NAME, 'row-name')
row_name = a.find_element(By.TAG_NAME, 'span').text
print(f'Row {i+1} name: {row_name}')
driver.quit()
i did like this
wanna know how to crawl the values in the tree.
It'd be better if you teach me how to crawl the layout on the right :)

Related

How to always find the newest blog post with Selenium [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I select the first post in the new posts section at publish0x.com/newposts with selenium?
The post will always be in the same spot on the page but the title and author will vary.
The page where the posts are located: https://www.publish0x.com/newposts
An example of a post: https://www.publish0x.com/dicecrypto-articles/welcome-to-altseason-as-major-altcoins-explodes-xmkmvkk

First get a list of all the posts, and then extract the first element:
driver.get('https://www.publish0x.com/newposts')
post = driver.find_element_by_css_selector('#main > div.infinite-scroll > div:nth-child(1) > div.content')
and now you can extract the author name and title from the post like:
title = post.find_element_by_css_selector('h2 > a').text
author = post.find_element_by_css_selector(p.text-secondary > small:nth-child(4) > a').text

driver.get("https://www.publish0x.com/newposts");
List<WebElement> LIST = driver.findElements(By.xpath("//div[#class="infinite-scroll"]//div/descendant::div[#class="content"]"));
System.out.println(LIST.size());
LIST.get(0).click();
use above xpath to click 1 st post ( this is java code, you can use same xpath for python)
try this and let me know

How do I extract the information from the website? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am trying to gather information of all the vessels from this website:
https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7
This is my code right now:
import selenium.webdriver as webdriver
url = "https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7"
browser = webdriver.Chrome(executable_path=r"C:\Users\CSA\OneDrive - College Sainte-Anne\Programming\PYTHON\Learning\WS\chromedriver_win32 (1)\chromedriver.exe")
browser.get(url)
browser.implicitly_wait(100)
Vessel_link = browser.find_element_by_class_name("ag-cell-content-link")
Vessel_link.click()
browser.implicitly_wait(30)
imo = browser.find_element_by_xpath('//*[#id="imo"]')
print(imo)
My output
I am using selenium, which isn't going to work because. I have several thousands of ships to extract data from and it just isn't going to be efficient. (Also, I only need to extract information from Cargo Vessels (U can find that using the filter or by looking at green signs on the vessel type column.) and I need to extract the country name(flag), the Imo and the Vessels name.
What should I use? Selenium or Bs4 + requests or other libraries? And How? I just started web scraping...
I can't get the Imo nor anything! The HTML structure is very weird.
I would appreciate any help. Thank You! :)

Instead of clicking each vessel to open up the details, you can get the information you're searching for from the results page. This will get each vessel, pull the info you wanted and click to the next page if there are more vessels:
import selenium.webdriver as webdriver
url = "https://www.marinetraffic.com/en/data/?asset_type=vessels&columns=flag,shipname,photo,recognized_next_port,reported_eta,reported_destination,current_port,imo,ship_type,show_on_live_map,time_of_latest_position,lat_of_latest_position,lon_of_latest_position&ship_type_in|in|Cargo%20Vessels|ship_type_in=7"
browser = webdriver.Chrome('C:\Users\CSA\OneDrive - College Sainte-Anne\Programming\PYTHON\Learning\WS\chromedriver_win32 (1)\')
browser.get(url)
browser.implicitly_wait(5)
checking_for_vessels = True
vessel_count = 0
while checking_for_vessels:
vessel_left_container = browser.find_element_by_class_name('ag-pinned-left-cols-container')
vessels_left = vessel_left_container.find_elements_by_css_selector('div[role="row"]')
vessel_right_container = browser.find_element_by_class_name("ag-body-container")
vessels_right = vessel_right_container.find_elements_by_css_selector('div[role="row"]')
for i in range(len(vessels_left)):
vessel_count += 1
vessel_country_list = vessels_left[i].find_elements_by_class_name('flag-icon')
if len(vessel_country_list) == 0:
vessel_country = 'Unknown'
else:
vessel_country = vessel_country_list[0].get_attribute('title')
vessel_name = vessels_left[i].find_element_by_class_name('ag-cell-content-link').text
vessel_imo = vessels_right[i].find_element_by_css_selector('[col-id="imo"] .ag-cell-content div').text
print('Vessel #' + str(vessel_count) + ': ' + vessel_name + ', ' + vessel_country + ', ' + vessel_imo)
pagination_container = browser.find_element_by_class_name('MuiTablePagination-actions')
page_number = pagination_container.find_element_by_css_selector('input').get_attribute('value')
max_page_number = pagination_container.find_element_by_class_name('MuiFormControl-root').get_attribute('max')
if page_number == max_page_number:
checking_for_vessels = False
else:
next_page_button = pagination_container.find_element_by_css_selector('button[title="Next page"]')
next_page_button.click()
There was one vessel that did not display a flag, so there's a check for that and the country is replaced with 'Unknown' if no flag found. The same kind of check can be done for the vessel name and imo.
The implicit wait was reduced to 5 because of the known issue of missing a flag on one vessel and waiting 100 seconds for this to be figured out was excessive. This number can be adjusted higher if you find there's issues waiting long enough to find elements.
It appears you are using a windows machine. You can place the path of your chromedriver in the PATH variable on your machine and then you don't have to use the path when you instantiate your browser driver. Obviously, your path to your chromedriver is different than mine, so hopefully what you provided is correct or else this won't work.

I like to work with bs4 but I think this info will help to.

Syntax error when using find_elements_by_xpath [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
Web scraping beginner here.
I am trying to get the amount of items on this webpage: https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand
However when I use the len()-function, it says there is an error in the syntax.
from bs4 import BeautifulSoup
import requests
import selenium
from selenium.webdriver import Firefox
driver = Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
for item in range(items):
price = item.find_element_by_xpath("/html/body/main/div/div/div/div[2]/div/div[1]/section/div/article[16]/a/p/span[1]")
print(price)
It then outputs this error:
File "C:/Users/rasmu/PycharmProjects/du nu ffs/jsscrape.py", line 13
items = len(driver.find_elements_by_xpath(//*[#id="product-12257648"])
^
SyntaxError: invalid syntax
Process finished with exit code 1

Try this:
items = len(driver.find_elements_by_xpath("//*[#id='product-12257648']"))
You need double quotes surrounding the XPath.
If you want all prices, you can refactor your code as such --
from selenium import webdriver
# start driver, navigate to url
driver = webdriver.Firefox()
url = "https://www.asos.com/dk/maend/a-to-z-of-brands/nike/cat/?cid=4766&refine=attribute_10992:61388&nlid=mw|sko|shop+efter+brand"
driver.get(url)
# iterate product price elements
for item in driver.find_elements_by_xpath("//p[./span[#data-auto-id='productTilePrice']]"):
# print price text of element
print(item.text)
driver.close()

when i turn this scraping code the result is empty [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
i try with this code to scrape html data from this site but the result is always empty .
https://www.seloger.com/annonces/achat/appartement/montpellier-34/alco/140769091.htm?ci=340172&idtt=2,5&idtypebien=2,1&naturebien=1,2,4&tri=initial&bd=ListToDetail
def parse(self, response):
content = response.css(".agence-adresse::text").extract()
yield{'adresse =' :content}

Try This css.
css = '.agence-adresse ::text'
content = response.css(css).extract()
yield {
'address': content,
}

how to pull out text from a div class using selenium headless [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to pull out the "0%" from the following div tag:
<div class="sem-report-header-td-diff ">0%</div>
my current code is:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path='mypath/chrome.exe',
chrome_options=options)
url = 'https://www.semrush.com/info/burton.com'
driver.get(url)
driver.implicitly_wait(2)
change_elements = driver.find_elements_by_xpath(xpath='//div[#class="sem-report-header-td-diff "]')
not sure what I'm doing wrong. This works with href tags, but its not working for this.

As per the HTML you have shared to extract the text 0% you need to use the method get_attribute("innerHTML") and you can use either of the following solutions:
css_selector:
myText = driver.find_element_by_css_selector("div.sem-report-header-td-diff").get_attribute("innerHTML")
xpath:
myText = driver.find_element_by_xpath("div[#class='sem-report-header-td-diff']").get_attribute("innerHTML")

First of all, not "elements", it is "element". and, second point is, you didn't get ttext. you just called element.
so, here is the code:
element_text =
driver.find_element_by_xpath("//div[#class='sem-report-header-td-diff']").text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed] - python

Related

How to always find the newest blog post with Selenium [closed]

How do I extract the information from the website? [closed]

Syntax error when using find_elements_by_xpath [closed]

when i turn this scraping code the result is empty [closed]

how to pull out text from a div class using selenium headless [closed]

Categories

Resources