This is the CSV file to be parsed in our program. This is one of many CSV files
ID|TITLE|COMPANY|DATE|REV|VIEW_TIME
id1|title1|company 1|2014-04-01|4.00|1:30
id1|title3|company 2|2014-04-03|6.00|2:05
id2|title4|company 1|2014-04-02|8.00|2:45
id3|title2|company 1|2014-04-02|4.00|1:05
The catch as given in the assignment is as follows
Your first task is to parse and import the file into a simple
datastore. You may use any file format that you want to implement to
store the data. For this assignment, you must write your own datastore
instead of reusing an existing one such as a SQL database. Also, you
must assume that after importing many files the entire datastore will
be too large to fit in memory. Records in the datastore should be
unique by ID, TITLE and DATE. Subsequent imports with the same
logical record should overwrite the earlier records.
Since the data structure can not hold all the data in memory, then I have to look into more permanent storage solution. Writing to a file seems more suitable than any other solution, but herein lies the catch. If I have to overwrite the contents on the basis of ID, TITLE, and DATE, then I would have to load the entire content in memory, before overwriting it, which is not possible, according to the precondition.
WHAT I AM LOOKING FOR
What approach should I take? I am not looking for a code sample, but I am hoping that some has an idea on which data structure or file structure to use. Anything suggestions like Use a stack or a list or a file structure is appreciated.
Related
I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.
For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name.
How can I query a parquet file like this in the Polars API, or possibly FastParquet (whichever is faster)?
I thought pl.scan_parquet might be helpful but realised it didn't seem so, or I just didn't understand it. Preferably, though it is not essential, we would not have to read the entire file into memory first, to reduce memory and CPU usage.
I thank you for your help.
I am working on a personal project (using Python 3) that will retrieve weather information for any city in the United States. My program prompts the user to enter as many city-state combinations as they wish, and then it retrieves the weather information and creates a weather summary for each city entered. Behind the scenes, I'm essentially taking the State entered by the user, opening a .txt file corresponding to that State, and then getting a weather code that is associated with the city entered, which I then use in a URL request to find weather information for the city. Since I have a .txt file for every state, I have 50 .txt files, each with a large number of city-weather code combinations.
Would it be faster to keep my algorithm the way that it currently is, or would it be faster to keep all of this data in a dictionary? This is how I was thinking about storing the data in a dictionary:
info = {'Virginia':{'City1':'ID1','City2':'ID2'},'North Carolina':{'City3':'ID3'}}
I'd be happy to provide some of my code or elaborate if necessary.
Thanks!
If you have a large datafile, you will spend days shifting through the file and putting the values in the .py file. If it is a small file I would use a dictionary, but if it were a large file a .txt file.
Other possible solutions are:
sqlite
pickle
shelve
Other Resources
Basic data storage with Python
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/shelve.html
It almost certainly would be much faster to preload the data from the files, if you're using the same python process for many user requests. If the process handles just one request and exits, this approach would be slower and use more memory. For some number of requests between "one" and "many", they'd be about equal on speed.
For a situation like this I would probably use sqlite, for which python has built-in support. It would be much faster than scanning text files without the time and memory overhead of loading the full dictionary.
It is probably not a very good idea to have a large amount of text files, because it will slow down in large or numerous director(y|ies) access. But If you have large data records, you might wish to choose an intermediate solution, in indexing one data file and load the index in a dictionary.
I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.
I had a hard time last week getting data out of Spark, in the end I had to simply go with
df.toPandas().to_csv('mycsv.csv')
out of this answer.
I had tested the more native
df.write.csv('mycsv.csv')
for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.
This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.
Another answer suggested:
query.repartition(1).write.csv("cc_out.csv", sep='|')
So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.
Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.
There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.
If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.
Does anyone know why Spark is doing this, why will it not simply output a csv,
Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.
how does it name the csv
Name depends on the partition number.
what is that success file supposed to contain
Nothing. It just indicates success.
This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.
However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.
This link also sheds some more information about this behavior of Spark:
Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.
I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.