The website Download the GLEIF Golden Copy and Delta Files.
has buttons that download data that I want to retrieve automatically with a python script. Usually when I want to download a file, I use mget or similar, but that will not work here (at least I don't think it will).
For some reason I cannot fathom, the producers of the data seem to want to force one to manually download the files. I really need to automate this to reduce the number of steps for my users (and frankly for me), since there are a great many files in addition to these and I want to automate as many as possible (all of them).
So my question is this - is there some kind of python package for doing this sort of thing? If not a python package, is there perhaps some other tool that is useful for it? I have to believe this is a common annoyance.
Yup, you can use BeautifulSoup to scrape the URLs then download them with requests.
Related
I'm trying to download files automatically from a utility, like PG&E, so I can view my statements easily since I have a couple of different accounts for one login. Currently, I'm using python to figure it out, but I'm stuck on how to actually write the code to download the files. As of right now, I can only access all of the statement information within Python, but cannot figure out how to download the actual bills itself. Does anyone have additional information or resources that you can share so I can try downloading bills using python?
I recently started coding, but took a brief stint. I started a new job and I’m under some confidential restrictions. I need to make sure python and pandas are secure before I do this—I’ll also be talking with IT on Monday
I was wondering if pandas in python was a local library, or does the data get sent to or from elsewhere? If I write something in pandas—will the data be stored somewhere under pandas?
The best example of what I’m doing is best found on a medium article about stripping data from tables that don’t have csv Exports.
https://medium.com/#ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58
Creating a DataFrame out of a dict, doing vectorized operations on its rows, printing out slices of it, etc. are all completely local. I'm not sure why this matters. Is your IT department going to say, "Well, this looks fishy—but some random guy on the internet says it's safe, so forget our policies, we'll allow it"? But, for what it's worth, you have this random guy on the internet saying it's safe.
However, Pandas can be used to make network requests. Some of the IO functions can take a URL instead of a filename or file object. Some of them can also use another library that does so—e.g., if you have lxml installed, read_html, will pass the filename to lxml to open, and if that filename is an HTTP URL, lxml will go fetch it.
This is rarely a concern, but if you want to get paranoid, you could imagine ways in which it might be.
For example, let's say your program is parsing user-supplied CSV files and doing some data processing on them. That's safe; there's no network access at all.
Now you add a way for the user to specify CSV files by URL, and you pass them into read_csv and go fetch them. Still safe; there is network access, but it's transparent to the end user and obviously needed for the user's task; if this weren't appropriate, your company wouldn't have asked you to add this feature.
Now you add a way for CSV files to reference other CSV files: if column 1 is #path/to/other/file, you recursively read and parse path/to/other/file and embed it in place of the current row. Now, what happens if I can give one of your users a CSV file where, buried at line 69105, there's #http://example.com/evilendpoint?track=me (an endpoint which does something evil, but then returns something that looks like a perfectly valid thing to insert at line 69105 of that CSV)? Now you may be facilitating my hacking of your employees, without even realizing it.
Of course this is a more limited version of exactly the same functionality that's in every web browser with HTML pages. But maybe your IT department has gotten paranoid and clamped down security on browsers and written an application-level sniffer to detect suspicious followup requests from HTML, and haven't thought to do the same thing for references in CSV files.
I don't think that's a problem a sane IT department should worry about. If your company doesn't trust you to think about these issues, they shouldn't hire you and assign you to write software that involves scraping the web. But then not every IT department is sane about what they do and don't get paranoid about. ("Sure, we can forward this under-1024 port to your laptop for you… but you'd better not install a newer version of Firefox than 16.0…")
I'm using urllib to open one site and get some information on it.
Is there a way to "open" this site only to the part I need and discard the rest (discard I mean don't open/load the rest)?
I'm not sure what you are trying to do. If you are simply trying to parse the site to find the useful "information", then I recommend using the library BeautifulSoup. That library makes it easy to keep certain parts of the site while discarding the rest.
If however you trying to save download bandwidth by downloading only a piece of the site, then you will need to do a lot more work. If that is the case please say so in your question and I'll update the answer.
You should be able to read(bytes) instead of read(), this will read a number of bytes instead of all of it. Then append to already downloaded bytes, and see if it contains what you're looking for. Then you should be able to stop download with .close().
On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?
I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.
Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).
I use cvs to maintain all my python snippets, notes, c, c++ code. As the hosting provider provides a public web- server also, I was thinking that I should convert the cvs automatically to a programming snippets website.
cvsweb is not what I mean.
doxygen is for a complete project and to browse the self-referencing codes online.I think doxygen is more like web based ctags.
I tried with rest2web, it is requires that I write /restweb headers and files to be .txt files and it will interfere with the programming language syntax.
An approach I have thought is:
1) run source-hightlight and create .html pages for all the scripts.
2) now write a script to index those script .htmls and create webpage.
3) Create the website of those pages.
before proceeding, I thought I shall discuss here, if the members have any suggestion.
What do do, when you want to maintain your snippets and notes in cvs and also auto generate it into a good website. I like rest2web for converting notes to html.
Run Trac on the server linked to the (svn) repository. The Trac wiki can conveniently refer to files and changesets. You get TODO tickets, too.
enscript or pygmentize (part of pygments) can be used to convert code to HTML. You can use a custom header or footer to link to the actual code for download.
I finally settled for rest2web. I had to do the following.
Use a separate python script to recursively copy the files in the CVS to a separate directory.
Added extra files index.txt and template.txt to all the directories which I wanted to be in the webpage.
The best thing about rest2web is that it supports python scripting within the template.txt, so I just ran a loop of the contents and indexed them in the page.
There is still lot more to go to automate the entire process. For eg. Inline viewing of programs and colorization, which I think can be done with some more trials.
I have the completed website here, It is called uthcode.