Is pandas a local-only library

Is pandas a local-only library - python

I recently started coding, but took a brief stint. I started a new job and I’m under some confidential restrictions. I need to make sure python and pandas are secure before I do this—I’ll also be talking with IT on Monday
I was wondering if pandas in python was a local library, or does the data get sent to or from elsewhere? If I write something in pandas—will the data be stored somewhere under pandas?
The best example of what I’m doing is best found on a medium article about stripping data from tables that don’t have csv Exports.
https://medium.com/#ageitgey/quick-tip-the-easiest-way-to-grab-data-out-of-a-web-page-in-python-7153cecfca58

Creating a DataFrame out of a dict, doing vectorized operations on its rows, printing out slices of it, etc. are all completely local. I'm not sure why this matters. Is your IT department going to say, "Well, this looks fishy—but some random guy on the internet says it's safe, so forget our policies, we'll allow it"? But, for what it's worth, you have this random guy on the internet saying it's safe.
However, Pandas can be used to make network requests. Some of the IO functions can take a URL instead of a filename or file object. Some of them can also use another library that does so—e.g., if you have lxml installed, read_html, will pass the filename to lxml to open, and if that filename is an HTTP URL, lxml will go fetch it.
This is rarely a concern, but if you want to get paranoid, you could imagine ways in which it might be.
For example, let's say your program is parsing user-supplied CSV files and doing some data processing on them. That's safe; there's no network access at all.
Now you add a way for the user to specify CSV files by URL, and you pass them into read_csv and go fetch them. Still safe; there is network access, but it's transparent to the end user and obviously needed for the user's task; if this weren't appropriate, your company wouldn't have asked you to add this feature.
Now you add a way for CSV files to reference other CSV files: if column 1 is #path/to/other/file, you recursively read and parse path/to/other/file and embed it in place of the current row. Now, what happens if I can give one of your users a CSV file where, buried at line 69105, there's #http://example.com/evilendpoint?track=me (an endpoint which does something evil, but then returns something that looks like a perfectly valid thing to insert at line 69105 of that CSV)? Now you may be facilitating my hacking of your employees, without even realizing it.
Of course this is a more limited version of exactly the same functionality that's in every web browser with HTML pages. But maybe your IT department has gotten paranoid and clamped down security on browsers and written an application-level sniffer to detect suspicious followup requests from HTML, and haven't thought to do the same thing for references in CSV files.
I don't think that's a problem a sane IT department should worry about. If your company doesn't trust you to think about these issues, they shouldn't hire you and assign you to write software that involves scraping the web. But then not every IT department is sane about what they do and don't get paranoid about. ("Sure, we can forward this under-1024 port to your laptop for you… but you'd better not install a newer version of Firefox than 16.0…")

Related

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?

I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Python and downloading Google Sheets feeds

I'm trying to download a spreadsheet from Google Drive inside a program I'm writing (so the data can be easily updated across all users), but I've run into a few problems:
First, and perhaps foolishly, I'm only wanting to use the basic python distribution, so I'm not requiring people to download multiple modules to run it. The urllib.request module seems to work well enough for basic downloading, specifically the urlopen() function, when I've tested it on normal webpages (more on why I say "normal" below).
Second, most questions and answers on here deal with retrieving a .csv from the spreadsheet. While this might work even better than trying to parse the feeds (and I have actually gotten it to work), using only the basic address means only the first sheet is downloaded, and I need to add a non-obvious gid to get the others. I want to have the program independent of the spreadsheet, so I only have to add new data online and the clients are automatically updated; trying to find a gid programmatically gives me trouble because:
Third, I can't actually get the feeds (interface described here) to be downloaded correctly. That does seem to be the best way to get what I want—download the overview of the entire spreadsheet, and from there obtain the addresses to each sheet—but if I try to send that through urlopen(feed).read() it just returns b''. While I'm not exactly sure what the problem is, I'd guess that the webpage is empty very briefly when it's first loaded, and that's what urlopen() thinks it should be returning. I've included what little code I'm using below, and was hoping someone had a way of working around this. Thanks!
import urllib.request as url
key = '1Eamsi8_3T_a0OfL926OdtJwLoWFrGjl1S2GiUAn75lU'
gid = '1193707515'
# Single sheet in CSV format
# feed = 'https://docs.google.com/spreadsheets/d/' + key + '/export?format=csv&gid=' + gid
# Document feed
feed = 'https://spreadsheets.google.com/feeds/worksheets/' + key + '/private/full'
csv = url.urlopen(feed).read()
(I don't actually mind publishing the key/gid, because I am planning on releasing this if I ever finish it.)

Requres OAuth2 or a password.
If you log out of google and try again with your browser, it fails (It failed when I did logged out). It looks like it requires a google account.
I did have it working with and application password a while ago. But I now use OAuth2. Both are quite a bit of messing about compared to CSV.

This sounds like a perfect use case for a wrapper library i once wrote. Let me know if you find it useful.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.

Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.

Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.

I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.

I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

migrating data from tomcat .dbx files

I want to migrate data from an old Tomcat/Jetty website to a new one which runs on Python & Django. Ideally I would like to populate the new website by directly reading the data from the old database and storing them in the new one.
Problem is that the database I was given comes in the form of a bunch of WEB-INF/data/*.dbx and I didn't find any way to read them. So, I have a few questions.
Which format do the WEB-INF/data/*.dbx use?
Is there a python module for directly reading from the WEB-INF/data/*.dbx files?
Is there some external tool for dumpint the WEB-INF/data/*.dbx to an ascii format that will be parsable by python?
If someone has attempted a similar data migration, how does it compare against scraping the data from the old website? (assuming that all important data can be scraped)
Thanks!

The ".dbx" suffix has been used by various softwares over the years so it could be almost anything. The only way to know what you really have here is to browse the source code of the legacy java app (or the relevant doc or ask the author etc).
wrt/ scraping, it's probably going to be a lot of a pain for not much results, depending on the app.

want to add url links to .csv datafeed using python

ive looked through the current related questions but have not managed to find anything similar to my needs.
Im in the process of creating a affiliate store using zencart - now one of the issues is that zencart is not designed for redirects and affiliate stores but it can be done. I will be changing the store so it acts like a showcase store showing prices.
There is a mod called easy populate which allows me to upload datafeeds. This is all well and good however my affiliate link will not be in each product. I can do it manually after uploading the data feed and going to each product and then adding it as an image with a redirect link - However when there are over 500 items its going to be a long repetitive and time consuming job.
I have been told that I can add the links to the data feed before uploading it to zencart and this should be done using python. Ive been reading about python for several days now and feel im looking for the wrong things. I was wondering if someone could please advise the simplest way for me to get this done.
I hope the question makes sense
thanks
abs

You could craft a python script using csv module like this:
>>> import csv
>>> cartWriter = csv.writer(open('yourcart.csv', 'wb'))
>>> cartWriter.writerow(['Product', 'yourinfo', 'yourlink'])
You need to know how link should be formatted hoping that it could be composed using the other parameters present on csv file.

First, use the CSV module as systempuntoout told you, secondly, you will want to change your header to:
mimetype='text/csv'
Content-Disposition = 'attachment; filename=name_of_your_file.csv'
The way to do it depends very much of your website implementation. In pure Python you would probably do that with an HttpResponse object. In django, as well, but there are some shortcuts.
You can find a video demonstrating how to create CSV files with Python on showmedo. It's not free however.
Now, to provide a link to download the CSV, this depends of your Website. What is the technology behinds it : pure Python, Django, Pylons, Tubogear ?
If you can't answer the question, you should ask your boss a training about your infrastructure before trying to make change to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.