Python: Scraping a CSV file request - python

A frequent and long lurker on here: I usually find my questions answered on here. However, I have come across perhaps a simple, yet vague project that escapes me. I am fairly new to Python (currently using ver 3.6).
I am looking at: https://www.ishares.com/us/products/239726/
From what I can tell, there is some jquery stuff involved here: looking near the "Holdings" portion of the page. Instead of 'Top 10' selected, if 'All' is selected, there is an option to get holdings 'as of.'
If a specific historical month is selected, a prompt to download a .csv is created. What I would like to do is get each csv file that is produced from the drop down list, going back to Sept 29, 2006. In other words, automatically downloading the .csv file that is produced for each request given through this drop down list.
To give some (not necessarily relevant) context, I am familiar with pandas and bs4, and perhaps some other less popular libraries. As background, I keep a couple of desk references: 'Beginning Python' by Magnus Lie Hetland and 'Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython' by Wes McKinney.
I would like some small direction on how to approach this issue that I may be overlooking. In other words, breadcrumbs are helpful, but not asking for anyone to do all this work for me. I would like to explore and learn as much as humanly possible.
What libraries/methods should I perhaps use? I understand this is completely open-ended, so I would like to stick to bs4 and Pandas as much as possible. Other libraries are helpful as well, but those would be the focus.
Thanks!

I would like some small direction on how to approach this issue
Using your browser's developer tools, examine the network requests being made. You will see that, when you choose a historical month, a request is made. If you copy the URL from that request, you can paste it into your browser to see if you can "replay" the request to get the payload. I tested it, and you can. What's more, you can see the query parameters quite clearly. They are not obfuscated. This means you can programatically generate URLs that you can then use cURL or wget on.
Do note that I tried to specify a file type of "csv" and got an empty response, but when I requested a file type of "json" I got the data. YMMV. Good luck!

Related

How to programmatically sync anki flashcard database with local file?

I would like to have a script ran by cron or an anki background job that will automatically read in a file (e.g. csv, tsv) containing all my flashcards and update the flashcard database in anki automatically so that i don't have to manually import my flashcards 1000 times a week.
any have any ideas how this can be achieved ?
Some interesting links I've came across, including from answers, that might provide a lead towards solutions:
https://github.com/langfield/ki
https://github.com/patarapolw/ankisync
https://github.com/towercity/anki-cli
The most robust approach there is so far is to have your collection under git, and use Ki to make Anki behave like a remote repository, so it's very easy to synchronise. The only constraint is the format of your collection. Each card is kept as a single file, and there is no real way around this.
I'm the maintainer of ki, one of the tools you linked! I really appreciate the shout-out #BlackBeans.
It's hard to give you perfect advice without more details about your workflow, but it sounds to me like you've got the source-of-truth for your notes in tabular files, and you import these files into Anki when you've made edits or added new notes.
If this is the case, ki may be what you're looking for. As #BlackBeans mentioned, this tool allows you to convert Anki notes into markdown files, and more generally, handles moving your data from your collection to a git repository, and back.
Basically, if the reason why you've got stuff in tabular files is (1) because you want to version it, (2) because you want to use an external editor, or (3) because your content is generated programmatically, then you might gain some use from using ki.
Feel free to open an issue describing your use case in detail. I'd love to give you some support in figuring out a good workflow if you think it would be helpful. I am in need of more user feedback at the moment, so you'd be helping me out, too!

Scraping data from a website with a search box

First of all I want to apologize if my question is too broad or generic, but it would really save me a lot of needlessly wasted time to get an answer to guide me in the right direction for the work I want to do. With that out of the way, here it goes.
I am trying to retrieve some publicly available data from a website, to create a dataset to work with for a Data Science project. My big issue is that the website does not have a friendly way to download it, and, from what I gathered, it also has no API. So, getting the data requires skills that I do not possess. I would love to learn how to scrape the website (the languages I am most comfortable with are Python and R), and it would add some value to my project if I did it, but I also am somewhat pressured by time constraints, and don't know if it is even possible to scrape the website, much less to learn how to do it in a few days.
The website is this one https://www.rnec.pt/pt_PT/pesquisa-de-estudos-clinicos. It has a search box, and the only option I configure is to click the banner that says "Pesquisa Avançada" and then mark the box that says "Menor de 18 anos". I then click the "Pesquisar" button in the lower-right, and the results that show up are the ones that I want to extract (either that or, if it's simpler, all the results, without checking the "Menor de 18 anos" box). In my computer, 2 results show up per page, and there are 38 pages total. Each result has some of it details in the page where the results appear but, to get the full data from each entry, one has to click "Detalhes" in the lower right of each result, which opens a display with all the data from that result. If possible, I would love to download all the data from that "Detalhes" page of each result (the data there alerady contains the fields that show up in the search result page).
Honestly, I am ready to waste a whole day manually transcribing all the data, but it would be much better to do it computationally, even it it takes me two or three days to learn and do it.
I think that, for someone with experience in web scraping, it is probably super simple to check if it is possible to download the data I described, and what is the best way to go about it (in general terms, I would research and learn it). But I really am lost when it comes to this, and just kindly want to ask for some help in showing me the way go about it (even if the answer is "it is too complicated/impossible, just do it manually"). I know that there are some Python packages for web scraping, like BeautifulSoup and Selenium, but I don't really know if either of them would be appropriate.
I am sorry if my request is not exactly a short and simple coding question, but I have to try to gather any help or guidance I can get. Thank you in advance to everyone who reads my question and a special thank you if you are able to give me some pointers.

Webscraping a page: notification on update stock prices

Question
I am interested in web scraping and data analysis and I would like to develop my skills by writing a program using Python 2.7 that will monitor changes in stock prices. My goal is to be able to compare two stocks (for the time being) at certain points throughout the day, save that info into a document format easily handled by pandas (which I will learn how to use after I get this front end working). In the end I would like to map relationship trends between chosen stocks (when this one goes up by x what effect does that have on the other one). This is just a hobby project so it doesn't matter if the code is production quality.
My Experience
I am a brand new Python programmer (I have a very basic understanding of python and no real experience with any non-included modules) but I do have a technical background so if the answer to my question requires reading and understanding documentation intended for intermediate level programmers that should be OK.
For the basics I am working my way through Learning Python: Powerful Object-Oriented Programming by Mark Lutz if this helps any.
What I'm Looking For
I recognize this is a very broad subject and I am not asking for anyone to write any actual code examples or anything. I just want some direction as to where to go to get information more specific to my interests and goals.
This is actually my first post on this forum so please forgive me if this doesn't follow best practices for posting. I did search for other questions like mine and read the posting tips docs prior to writing this.
So, you want to web-scrape? If you're using Python 2.7, then you'll want to look into the urllib2, requests and BeautifulSoup libraries. If you're using Python 3.x, then you'll want to peek at urllib, urllib.request, and BeautifulSoup, again. Together, these libraries should accomplish everything you're looking to do in terms of web-scraping.
If you're interested in scraping stock data, might I suggest the yahoo_finance package? This is a Python wrapper for the Yahoo Finance API. Whenever I've done things with stock data in the past, this module was invaluable. There's also googlefinance, too. It's much easier to use these already-developed wrappers to extract stock info, rather than scraping hundreds (if not thousands) of web pages to get the data you want.

Total Downloads of Module Missing on PyPi

Up until recently, it was possible to see how many times a python module indexed on https://pypi.python.org/pypi had been downloaded (each module listed downloads for the past 24hrs, week and month). Now that information seems to be missing.
Download numbers are very helpful information when evaluating whether to build code off of one module or another. They also seem to be referenced by sites such as https://img.shields.io/
Does anyone know what happened? And/or, where I can view/retrieve that information?
This email from Donald Stufft (PyPI maintainer) from distutils mailing list says:
Just an FYI, I've disabled download counts on PyPI for the time being. The statistics stack is broken and needs engineering effort to fix it back up to deal with changes to PyPI. It was suggested that hiding the counts would help prevent user confusion when they see things like "downloaded 0 times" making people believe that a library has no users, even if it is a significantly downloaded library.
I'm unlikely to get around to fixing the current stack since, as part of Warehouse, I'm working on a new statistics stack which is much better. The data collection and storage parts of that stack are already done and I just need to get querying done (made more difficult by the fact that the new system queries can take 10+ seconds to complete, but can be queried on any dimension) and a tool to process the historical data and put it into the new storage engine.
Anyways, this is just to let folks know that this isn't a permanent loss of the feature and we won't lose any data.
So i guess we'll have to wait for a new stats stack in PyPI.
I just released http://pepy.tech/ to view the downloads of a package. I use the official data which is stored in BigQuery. I hope you will find it interesting :-)
Also the site is open source https://github.com/psincraian/pepy
Don't know what happened (although it happened before, i.e.) but you might wan't to try the PyPI ranking, or any of the several available modules and recipes to do this. For example:
Vanity
pyStats
random recipe
But consider that a lot of the downloads might be mirrors and not necessarily "real" user downloads. You should that into account in you evaluation. The libs mailing list (or other preferred media) might be a better way to know what version you should install.
PYPI count is disable temporarily as posted by dmand but there are some sites which may tells you python package statistics like pypi-stats.com (they said it shows real time information) and pypi-ranking.info (this might not gives you real time information).
You can also found some pypi packages which can gives you downloads information.

Python library for accessing local wikipedia?

I am trying to do some research on the wikipedia data, I am good at Python.
I came across this library, seems nice: https://pypi.python.org/pypi/wikipedia/
I don't want to hit wikipedia directly as this is slow, and also I am trying to access a lot of data and might run into their API limits.
Can I somehow hack this to make it access a local instance of wikipedia data. I know I can run a whole wikipedia server and try to do that, but that seems a round about way.
Is there a way to just point to the folder and get this library to work as it does. Or are you aware of any other libraries that do this?
thank you.
I figured out what I need. I think I shouldn't be searching for API, what I am looking for is a parser. Here are a couple options I have narrowed down so far. Both seem like solid starting points.
wikidump:
https://pypi.python.org/pypi/wikidump/0.1.2
mwlib:
https://pypi.python.org/pypi/mwlib/0.15.14
Update: While these are good parsers for wikipedia data, I found them too limiting in one way or the other, not to mention the lack of documentation. So I eventually went with good old python ElementTree and directly work with the XML.

Categories