when i turn this scraping code the result is empty [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
i try with this code to scrape html data from this site but the result is always empty .
https://www.seloger.com/annonces/achat/appartement/montpellier-34/alco/140769091.htm?ci=340172&idtt=2,5&idtypebien=2,1&naturebien=1,2,4&tri=initial&bd=ListToDetail
def parse(self, response):
content = response.css(".agence-adresse::text").extract()
yield{'adresse =' :content}

Try This css.
css = '.agence-adresse ::text'
content = response.css(css).extract()
yield {
'address': content,
}

Related

i'm learning how to crawling site in python! but i don't know how to do it to tree structure [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
When I press each item on "https://dicom.innolitics.com/ciods" this site (like CR Image, Patient, Referenced Patient Sequence ...these values) , I want to save the descriptions of the items in the right layout in a variable.
I'm trying to save the values by clicking on the items on the left.
But, I found out that none of the values in the tree were crawled!
driver = webdriver.Chrome()
url = "https://dicom.innolitics.com/ciods"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tree-table')))
table_list = []
tree_table = driver.find_element(By.CLASS_NAME, 'tree-table')
tree_rows = tree_table.find_elements(By.TAG_NAME, 'tr')
for i, row in enumerate(tree_rows):
row.click()
td = row.find_element(By.TAG_NAME, 'td')
a = td.find_element(By.CLASS_NAME, 'row-name')
row_name = a.find_element(By.TAG_NAME, 'span').text
print(f'Row {i+1} name: {row_name}')
driver.quit()
i did like this
wanna know how to crawl the values in the tree.
It'd be better if you teach me how to crawl the layout on the right :)

Scrape this value from a request response [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
unluckily i can provide only the output of the request and not the full code since it contains quite private infos, basically when printing the reuqest as text file i get a json one, something like that:
{"redirectUrl":"https://www.paypal.com/cgi-bin/webscr?cmd=_express-checkout\u0026token=EC-....."}
How can i scrape that Paypal url? tried by doing this but it didn't worked:
content = checkout.text()
checkout.url = content["redirectUrl"]
I only get this error while doing it:
content = checkout.text() TypeError: 'str' object is not callable
Assuming you're using the requests library, you can simply do:
resp = requests.get(url=url)
content = resp.json()
redirect_url = content["redirectUrl"]
You can read more about the request content here, including the json request content.

Extracting number of links from a website [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I would need to extract the number of links in a website, e.g., https://stackoverflow.com/questions/ask (just for example)
I have tried to use urlparse to extract url information, then beautiful soup.
domain_name = urlparse(url).netloc
soup = BeautifulSoup(requests.get(url).content, "html.parser")
I would need to save in a list all the links in the website for each website. I would like to have something like this:
URL Links
https://stackoverflow.com/questions/ask ['link1','link2','link3',...]
https://anotherwebsite.com/sport ['link1','link2','link3','link4']
https://last_example.es []
Could you please explain me how to get similar results?
Let's try:
def get_all_links(url):
# of course one needs to deal with the case when `requests` fails
# but that's outside the scope here
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return [a.attrs.get('href', '') for a in soup.find_all('a')]
# sample data
df = pd.DataFrame({'URL':['https://stackoverflow.com/questions/ask']})
df['Links'] = df['URL'].apply(get_all_links)
Output:
URL Links
0 https://stackoverflow.com/questions/ask [#, https://stackoverflow.com, /company, #, /t...

How to always find the newest blog post with Selenium [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
How can I select the first post in the new posts section at publish0x.com/newposts with selenium?
The post will always be in the same spot on the page but the title and author will vary.
The page where the posts are located: https://www.publish0x.com/newposts
An example of a post: https://www.publish0x.com/dicecrypto-articles/welcome-to-altseason-as-major-altcoins-explodes-xmkmvkk
First get a list of all the posts, and then extract the first element:
driver.get('https://www.publish0x.com/newposts')
post = driver.find_element_by_css_selector('#main > div.infinite-scroll > div:nth-child(1) > div.content')
and now you can extract the author name and title from the post like:
title = post.find_element_by_css_selector('h2 > a').text
author = post.find_element_by_css_selector(p.text-secondary > small:nth-child(4) > a').text
driver.get("https://www.publish0x.com/newposts");
List<WebElement> LIST = driver.findElements(By.xpath("//div[#class="infinite-scroll"]//div/descendant::div[#class="content"]"));
System.out.println(LIST.size());
LIST.get(0).click();
use above xpath to click 1 st post ( this is java code, you can use same xpath for python)
try this and let me know

How to find backlinks in a website with python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am kind of stuck with this situation, I want to find backlinks of websites, I cannot find how to do it, here is my regex:
readh = BeautifulSoup(urllib.urlopen("http://www.google.com/").read()).findAll("a",href=re.compile("^http"))
What I want to do, to find backlinks, is that, finding links that starts with http but not the links that include google, and I cannot figure out how to manage this?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div>hello</div>
Not this one"
Link 1
Link 2
"""
def processor(tag):
href = tag.get('href')
if not href: return False
return True if (href.find("google") == -1) else False
soup = BeautifulSoup(html)
back_links = soup.findAll(processor, href=re.compile(r"^http"))
print back_links
--output:--
[Link 2]
However, it may be more efficient just to get all the links starting with http, then search those links for links that do not have 'google' in their hrefs:
http_links = soup.findAll('a', href=re.compile(r"^http"))
results = [a for a in http_links if a['href'].find('google') == -1]
print results
--output:--
[Link 2]
Here is a regexp that matches http pages but not if including google:
re.compile("(?!.*google)^http://(www.)?.*")

Categories