Parse HTML, 'ValueError: stat: path too long for Windows'

Parse HTML, 'ValueError: stat: path too long for Windows' - python

I'm trying to scrape data from NYSE's website, from this URL:
nyse = http://www1.nyse.com/about/listed/IPO_Index.html
Using requests, my I've set my request up like this:
page = requests.get(nyse)
soup = BeautifulSoup(page.text)
tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))
However, I keep getting this error
'ValueError: stat: path too long for Windows'
I don't understand how to interpret this error, and furthermore, solve the problem. I've seen one other posting on this area (Copy a file with a too long path to another directory in Python) but I don't fully understand the workaround, and am not sure which path is the problem in this case.
The error is getting thrown at the test = pandas.io.... line but there isn't a clear definition of path, where I'm storing the table locally. Do I need to use pywin32? Why does this error only show for some URLs and not others? How do I solve this problem?
For reference, I'm using python 3.4
Update:
The error only appears with the nyse website, and not for others that I'm also scraping. In all cases, I'm doing the str(tables) conversion.

The pandas read_html method accepts urls, files, or raw HTML strings as its first argument. It definitely looks like it's trying to interpret the str(tables) argument as a URL -- which would of course be quite long and overrun whatever limit Windows apparently has.
Are you certain that str(tables) produces raw, parseable HTML? Tables looks like it would be represented as a list of abstract node objects -- it seems likely that calling str() on this would not produce what you're looking for.

Related

Web scraping without Javascript in Python using requests

So, I'm making a Python script, that gets a webpages content and compares it to a previously saved version to see if the webpage has changed. I'm getting the raw content using this method:
def getcontent(url):
str = requests.get(url)
str = str.text
return(str)
after that I'm doing some cleaning up of the content and quote escaping and such, but that's irrelevant. The issue I keep running into, is, that the webpage has got some JavaScript code, that generates a unique key that my method downloads. Each time you grab the webpage content, the key is different. I have zero idea what that key is for. The issue is, that if the key is different, the new content, and the saved content aren't identical.
How can I disable JavaScript from running when I request a webpage?

The token is generated server-side and can be used for various reasons (for example CSRF-token)
The token will always be in the content of your response, there is no JavaScript needed for that.
You should find a way to ignore / remove the token.

Accessing a hidden form using MechanicalSoup will result in "Value Error: No closing quotation"

First of all my english is not my native language.
Problem
I try to access and manipulate a form using MechanicalSoup as described in the docs. I did successfull login to the page using the given login form which I found using the "debug mode"(F12) built into chrome.
form action="https://www.thegoodwillout.de/customer/account/loginPost/"
The Form can be found here using the chrome "debugger"
this is working fine and will not produce any error. I tried to up my game and move to a more complicated form which is given on this site. I managed to track down the form to this snippet
form action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"
This will result in a
ValueError: No Closing quotation
which is weird since it does not use any special characters and I double checked so that every quotation is closing correctly
What have I tried
I tried tracking down a more specific form which will apply for the given shoe size but this form seems to manage all the content on the Website. I searched the web and found several articles pointing to a bug inside python which I cannot believe will be true!
Source Code with attached error log
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.thegoodwillout.de/nike-air-vortex-schwarz-weiss-anthrazit-903896-010")
browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU,/product/115178/form_key/r19gQi8K03l21bYk/"]')
NOTE: it all seems to track down to a module called shlex which is causing the error
Finally the error log
It would be really helpful if you could point me into the right directions and link some Websites I may not have fully investigated yet.

It's actually an issue with BeautifulSoup4, the library used by MechanicalSoup to navigate within HTML documents, related to the fact that you use a comma (,) in the CSS selector.
BeautifulSoup splits CSS selectors on commas, and therefore considers your query as: browser.select_form('form[action="https://www.thegoodwillout.de/checkout/cart/add/uenc/aHR0cHM6Ly93d3cudGhlZ29vZHdpbGxvdXQuZGUvbmlrZS1haXItdm9ydGV4LXNjaHdhcnotd2Vpc3MtYW50aHJheml0LTkwMzg5Ni0wMTA_X19fU0lEPVU and /product/115178/form_key/r19gQi8K03l21bYk/"], parsed separately. When parsing the first, it finds an opening " but no closing ", and errors out.
It's somewhat a feature (you can specify multiple CSS selectors as argument to select), but it's useless here as a feature (there's no point providing several selectors when you expect a single object).
Solution: don't use commas in CSS selectors. You probably have other criteria to match your form.
You may try using %2C instead of the comma (untested).

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you

Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

Scrape with Python , commanded by excel vba

I already had a previous question, but that was pasted in vba tags etc. So I'll try again with proper tags and title since I gained a bit of knowledge now, hopefully.
The problem:
I need to find ~1000 dates from a database with plant variety data which probably is behind a login so here is a screenshot . Now I could of course fill out this form ~1000 times but there must be a smarter way to do this. If it were an HTML site I would know what to do, and have vba just pull in the results. I have been reading all morning about these javascript pages and ajax libraries but it is above my level. So hopefully someone can help me out a bit. I also used firebug to see what is going on when I press search:
These codes are similar to the last picture posted, make it easier to read. Code left here for copying.
f.cc.facet.limit
-1
f.cc.facet.mincount
1
f.end_date.facet.date.end
2030-01-01T00:00:00Z
f.end_date.facet.date.gap
+5YEARS
f.end_date.facet.date.oth...
all
f.end_date.facet.date.sta...
1945-01-01T00:00:00Z
f.end_type.facet.limit
20
f.end_type.facet.mincount
1
f.grant_start_date.facet....
NOW/YEAR
f.grant_start_date.facet....
+5YEARS
f.grant_start_date.facet....
all
f.grant_start_date.facet....
1900-01-01T00:00:00Z
f.status.facet.limit
20
f.status.facet.mincount
1
f.type.facet.limit
20
f.type.facet.mincount
1
facet
true
facet.date
grant_start_date
facet.date
end_date
facet.field
cc
facet.field
type
facet.field
status
facet.field
end_type
fl
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl
true
hl.fl
cc,latin_name,den_info,den_final
hl.fragsize
5000
hl.requireFieldMatch
false
json.nl
map
q
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi
3-9BgbCWwYBd7aIWPU1/onjQ==
rows
25
sort
uc asc,score desc
start
0
type
upov
wt
json
Source
fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date
And this is what it looks like in HTML, atleast according to firebug:
{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}
Edit:
It uses the GET method and XMLHttpRequest, as can be seen from this screenshot:
I already found how to make python run from excel vba here in this topic
I also downloaded beautiful soup but python is not my kind of language, so any help would be greatly appreciated.
Image refered to in comment on answer of Will

1) Use Excel to store your search parameters.
2) Run a few manual searches to find out what parameters you need to change on each request.
3) Invoke an http get request to the url that you have found in firebug/Fiddler (the url that it calls when you click "search" manually). See Urllib3 https://urllib3.readthedocs.org/en/latest/
3) Look at Json pickle to help you deal with the json response, saving (serializing) it to a file.
4) Reading and writing data involves IO libraries. Google is your friend. (Possibly easier to save your excel file as a csv and then just read the csv file for your search parameters).
5) Download PyCharm for your python development - it's really good.
Hope this helps.

I finally figured it out. I don't need to use python, I can just use an url, and then import the content into excel. I found out with Fiddler that the URL should become https://www3.wipo.int/pluto/user/jsp/select.jsp? And then the piece of code from the OP goes behind that.
The rest of my solution can be found in another question I had. It uses no Python but only VBA, which commands IE to open a website and copies the content of it.

Ignore Wikipedia Redirects with mwlib

I'm using mwlib in Python to iterate over a Wikipedia dump. I want to ignore redirects and just look at page contents with the actual full title. I've already run mw-buildcdb, and I'm loading that:
wiki_env = wiki.makewiki(wiki_conf_file)
When I loop over wiki_env.wiki.articles(), the strings appear to contain redirect titles (I've checked this on a couple of samples against Wikipedia). I don't see an accessor that skips these, and wiki_env.wiki.redirects is an empty dictionary, so I can't check which article titles are actually just redirects that way.
I've tried looking through the mwlib code, but if I use
page = wiki_env.wiki.get_page(page_title)
wiki_env.wiki.nshandler.redirect_matcher(page.rawtext)
the page.rawtext appears to already be redirected (containing the full page content, and no indication that there is a title mismatch). Similarly the Article node returned by getParsedArticle() does not appear to contain the "true" title to check against.
Anyone know how to do this? Do I need to run mw-buildcdb in a way to not store redirects? As far as I can tell that command just takes an input dump file and an output CDB, with no other options.

When in doubt, patch it yourself. :o)
mw-buildcdb now takes an --ignore-redirects command-line option: https://github.com/pediapress/mwlib/commit/f9198fa8288faf4893b25a6b1644e4997a8ff9b2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.