Processing data from a large data grab

Processing data from a large data grab - python

I've downloaded a large (>75GB) data grab from archive.org containing most or all of the tweets from June 2020. The archive itself consists of 31 .tar files, each containing nested folders with the lowest level containing several compressed .json files. I need a way to access the data stored in this archive from my Python application. I would like to use MongoDB since its document-based database structure seems well suited to the type of data in this archive. What would be the best way of doing so?
Here is what the archive looks like (you can find it here):
Any help would be appreciated.
Edit - to be clear, I am not set on using MongoDB. I am open to other database solutions as well.

MongoDB is certainly not a good idea. Because you need to load the database in the RAM. Excepted if you have a cluster or so, you surely not have enough RAM to host this content.
So you may want to filter it if you still want to use mongoDB to this extent.

Related

Is there a good way to store large amounts of similar scraped HTML files in Python?

I've written a web scraper in Python and I have a ton (thousands) of files that are extremely similar, but not quite identical. The disk space used currently used by the files is 1.8 GB, but if I compress them into a tar.xz, they compress to 14.4 MB. I want to be closer to that 14.4 MB than the 1.8 GB.
Here are some things I've considered:
I could just use tarfile in Python's standard library and store the files there. The problem with that is I wouldn't be able to modify the files within the tar without recompressing all of the files which would take a while.
I could just use the difflib in Python's standard library, but I've found that this library doesn't offer any way of applying "patches" to recreate the new file.
I could use Google's diff-match-patch Python library, but when I was reading the documentation, they said "Attempting to feed HTML, XML or some other structured content through a fuzzy match or patch may result in problems.", well considering I wanted to use this library to more efficiently store HTML files, that doesn't sound like it'll help me.
So is there a way of saving disk space when storing a large amount of similar HTML files?

You can use a dictionary.
Python's zlib interface supports dictionaries. The compressobj and decompressobj functions both take an optional zdict argument, which is a "dictionary". A dictionary in this case is nothing more than 32K of data with sequences of bytes that you expect will appear in the data you are compressing.
Since your files are about 30K each, this works out quite well for your application. If indeed your files are "extremely similar", then you can take one of those files and use it as the dictionary to compress all of the other files.
Try it, and measure the improvement in compression over not using a dictionary.

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?

I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

What's the best strategy for dumping very large python dictionaries to a database?

I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.

Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.

So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)

How can I reduce the access time on large Excel files?

I would like to process a large data set of a mechanical testing device with Python. The software of this device only allows to export the data as an Excel file. Therefore, I use the xlrd package which works fine for small *.xlsx files.
The problem I have is, that when I want to open a common data set (3-5 MB) by
xlrd.open_workbook(path_wb)
the access time is about 30s to 60s. Is there any more effecitve and faster way to access Excel files?

You could access the file as a database via PyPyODBC instead, which may (or may not) be faster - you'd have to try it out and compare the results.
This method should work for both .xls and .xlsx files. Unfortunately, it comes with a couple of caveats:
As far as I am aware, this will only work on Windows machines, since you're relying on the Microsoft Jet database driver.
The Microsoft Jet database driver can be rather buggy, especially with dates.
It's not possible to create or modify Excel files (a note in the PyPyODBC exceltests.py file says: I have not been able to successfully create or modify Excel files.). Your question seems to indicate that you're only interested in reading files, though, so hopefully this will not be a problem.

I just figured out that it wasn't actually the problem with the access time but I created an object in the same step. Now, by creating the object separately everything works fast and nice.

Choice of technology for loading large CSV files to Oracle tables

I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience.
I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML.
The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.
I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server.
Options I am considering,
Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
Python script to convert the CSV to XML.
SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).
Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.
regards,
-v-
PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.

Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.

In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.

Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.