Python webpage scraping can't find form from this page - python

I want to cycle thru the dates at the bottom of the page using what looks like a form. But it is returning a blank. Here is my code.
import mechanize
URL='http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp'
br = mechanize.Browser()
r=br.open(URL)
for form in br.forms(): #finding the name of the form
print form.name
print form
Why is this not returning any forms? it is not a form? if not, how do I control the year and month at the bottom to cycle thru the pages?
Can someone provide some sample code on how to do it?

Trying to access that page what you are actually doing is being directed to an error page. Paste that url in a browser and you get a page with:
Not comply with the conditions of the inquiry data
and no forms at all
You need to access the page in a different way. I would suggest stepping throught the url directory until you find the right path.

Related

Selenium find element with form action

I am currently trying out Selenium to develop a program to automate testing of the login form of websites.
I am trying to use Selenium to find a form on the websites that I am testing on and I've noticed that different websites has different form name, form id or even websites that doesn't have both.
But from my observations, I've noticed that form action is always there and I've used the codes below to retrieve the name of the form action
request = requests.get("whicheverwebsite")
parseHTML = BeautifulSoup(request.text, 'html.parser')
htmlForm = parseHTML.form
formName = htmlForm['action']
I am trying to retrieve the form and then using form.submit() to submit.
I know of the functions find_element_by_name and find_element_by_name, but as I am trying to find the element by action and i am not sure how this can be done.
I've found the answer to this.
By using xpath and using form and action, I am able to achieve this.
form = driver.find_element_by_xpath("//form[#action='" + formName + "']")
I would recommend including the url of one or two of the sites you are trying to scrape and your full code. Based on the information above, it appears that you are using BeautifulSoup rather then Selenium.
I would use the following:
from selenium import webdriver
url = 'https://whicheverwebsiteyouareusing.com'
driver = webdriver.Chrome()
driver.get(url)
From there you have many options to select the form, but again, without the actual site we can't identify which would be most relevant. I would recommend reading https://selenium-python.readthedocs.io/locating-elements.html to find out which would be most applicable to your situation.
Hope this helps.
Keep in mind that login page can have multiple form tags even if you see only one. Here is the Example when login page has only one visible form though there are 3 ones in DOM.
So the most reliable way is to dig into the form (if there are multiple ones) and check two things:
If there's [type=password] element (we definitely need a password to log in)
If there's the 2nd input there (though it can be considered as optional)
Ruby example:
forms = page.all(:xpath, '//form') # retrieve all the forms and iterate
forms.each do |form|
# if there's a password field if there's two input fields in general
if form.has_css?('input[type=password']) && form.all(:xpath, '//input').count == 2)
return form
end
end

Scrape data from JavaScript-rendered website

I want to scrap Lulu webstore. I have the following problems with it.
The website content is loaded dynamically.
The website when tried to access, redirects to choose country page.
After choosing country, it pops up select delivery location and then redirects to home page.
When you try to hit end page programmatically, you get an empty response because the content is loaded dynamically.
I have a list of end URLs from which I have to scrape data. For example, consider mobile accessories. Now I want to
Get the HTML source of that page directly, which is loaded dynamically bypassing choose country, select location popups, so that I can use my Scrapy Xpath selectors to extract data.
If you suggest me to use Selenium, PhantomJS, Ghost or something else to deal with dynamic content, please understand that I want the end HTML source as in a web browser after processing all dynamic content which will be sent to Scrapy.
Also, I tried using proxies to skip choose country popup but still it loads it and select delivery location.
I've tried using Splash, but it returns me the source of choose country page.
At last I found answer. I used EditThisCookie plugin to view the cookies that are loaded by the Web Page. I found that it stores 3 cookies CurrencyCode,ServerId,Site_Config in my local storage. I used the above mentioned plugin to copy the cookies in JSON format. I referred this manual for setting cookies in the requests.
Now I'm able to skip those location,delivery address popups. After that I found that the dynamic pages are loaded via <script type=text/javascript> and found that part of page url is stored in a variable. I extracted the value using split(). Here is the script part to get the dynamic page url.
from lxml import html
page_source=requests.get(url,cookies=jar)
tree=html.fromstring(page_source.content)
dynamic_pg_link=tree.xpath('//div[#class="col3_T02"]/div/script/text()')[0] #entire javascript to load product pages
dynamic_pg_link=dynamic_pg_link.split("=")[1].split(";")[0].strip()#obtains the dynamic page url.
page_link="http://www.luluwebstore.com/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput="+dynamic_pg_link
Now I'm able to extract data from these LInks.
Thanks to #Cal Eliacheff for the previous guidance.

python urlopen : only the first attribute of the URL is taken via data

I'm trying to scrap a website.
It consists in a soccer website that lists all the matches of all the seasons.
So i'm trying to scrap the html pages of every game of every season
Here is the url : http://www.lfp.fr/ligue1/calendrier_resultat#sai=77&jour=1
What I am doing is :
url = 'http://www.lfp.fr/ligue1/calendrier_resultat#'
data = {'sai':77,'jour':10}
url_values = urlencode(data)
response = urlopen(url,url_values)
soup = BeautifulSoup(response)
sai is the season
and jour is the week
the problem is that the page given only depends on the 'sai' value, no matter what 'jour' is equal to, it will always returns the same page and it will always be the last week.
For example I can enter an url like this :
http://www.lfp.fr/ligue1/calendrier_resultat#sai=77OUHIGYGO8TY98
It will never care what comes after sai=77.
I don't know why it does this and I really need some help.
Thanks
Thanks to some other stackoverflow posts' answers,I finally resolved the problem.
The problem here was that the URL part after # is a client-side part that is never sent to the server. Actually, there is a true URL that is sent to the server and I found it using Mozilla : you just go to the developer tools and go to the network tab. Scroll through the tab elements on the left side and for each element you will see the corresponding "request URL".
If you pay attention you will find the right element whose request URL looks like to the URL with the # symbol. Just copy-paste it and your problem is resolved.

python mechanize filling out form

I am lost on what I can do to use mechanize to fill out the form of the following website and then click submit.
https://dxtra.markets.reuters.com/Dx/DxnHtm/Default.htm
on the left side click currency information
then value dates
This is for a finance class of mine and we need the dates for many different currency pairs. I wanted to get in and put in the date in the "trade Date" and then select what "base" and "quote" I wanted then click submit and get the days. off the next page using beautiful soup.
1). is this possible using mechanize?
2). how do I go about this> I have read the docs on the website and looked all through Stackoverflow but I can't seem to get this to work at all. I was trying to get the form and then set what I want but I can't get the correct forms.
Any help would be greatly appreciated, I am not tied down to mechanize, but just not sure what the best module to use it.
This is what I have so far, and I get ZERO forms to attach a value to.
from mechanize import Browser
import urllib2
br = Browser()
baseURL = "https://dxtra.markets.reuters.com/Dx/DxnHtm/Default.htm"
br.open(baseURL)
for form in br.forms():
print form
Mechanize can't find any form on that page. It's parse only html response which you received after request with baseURL. When you click on value dates it's send another request and received another html for parsing. Seems you should use https://dxtra.markets.reuters.com/Dx/DxnOutbound/400201404162135222149001.htm as baseURL value. Also python mechanize doesn't support ajax calls. For more complicated tasks you can use python-selenium. It's more powerful tool for web-browsing.

Mechanize: submitting form but not loading new page to see results

Okay, so I'm starting to get a little frustrated. I've spent most of a day trying to figure out why my script is not working - both on github and here. It should be fairly simple. Mechanize load a page, fill in a form, submit the form, opens a new page with company information and post the content. It's just not working. When I check the code, I can see, that the right form is filled out, but after mechanize submits the form it doesn't go to the new page but stay's on the one, where it filled out the form. Code is like this:
from mechanize import Browser
br = Browser()
url = "http://cvr.dk/Site/Forms/CMS/DisplayPage.aspx?pageid=0"
cvr = br.open(url).read()
#I select the form
br.select_form(name="aspnetForm")
#I fill in 19997049 as a company number
br.form['ctl00$QuickSearch1$CvrTextBox'] = "19997049"
response = br.submit()
content = response.read()
print content
I have a feeling it's extremely simple, but that I'm missing something with the redirect that should happen, when the form is submitted.
EDIT: It seems like there's alot of javascripts on the site. Might that be the reason? And when what a the options like?
EDIT2: Okay, it seems that I can simply add the company number in the url and get the page that I want that way, but I'm still puzzled as to why this script doesn't work.
Thanks a bunch for any feedback
You need to tell it which button to use:
response = br.submit(name='ctl00$QuickSearch1$CvrSearchButton')
Which works but raises a problem with robots.txt, an ethical dilemma.

Categories