Sorting a big file using its chunks

Sorting a big file using its chunks - python

Suppose we want to sort a file that has 40000 rows around a column=X. Let us also assume that same values are widespread across the table, so that rows with same value in column=X are found not only in the top 1000 rows. Now if we read file by chunks and consider only 1000 rows, we might mess the other rows with same value found in column=X if we are to sort again the table around that column. So how we can solve this issue please? No code is needed since no data is available, but please I am looking for your opinion on the matter? Should we go with merge sort by giving each chunk to a merge sort algorithm parallelly and then recombine the results? I don't see that there is a way to do that with pandas, but I am not sure?
import pandas as pd
chunk_size = 1000
batch_no = 1
for chunk in pd.read_csv('data.csv', chunksize=chunk_size):
chunk.sort_values(by='X', inplace=True)
chunk.to_csv('data' +str(batch_no) + '.csv', index=False)
batch_no +=1

You need to merge the sorted csv files, luckily Python provides a function for it. Use it as below:
from operator import itemgetter
import pandas as pd
import numpy as np
import csv
import heapq
# generate test data
test_data = pd.DataFrame(data=[[f"label{i}", val] for i, val in enumerate(np.random.uniform(size=40000))],
columns=["label", "X"])
test_data.to_csv("data.csv", index=False)
# read and sort each chunk
chunk_size = 1000
file_names = []
for batch_no, chunk in enumerate(pd.read_csv("data.csv", chunksize=chunk_size), 1):
chunk.sort_values(by="X", inplace=True)
file_name = f"data_{batch_no}.csv"
chunk.to_csv(file_name, index=False)
file_names.append(file_name)
# merge the chunks
chunks = [csv.DictReader(open(file_name)) for file_name in file_names]
with open("data_sorted.csv", "w") as outfile:
field_names = ["label", "X"]
writer = csv.DictWriter(outfile, fieldnames=field_names)
writer.writeheader()
for row in heapq.merge(*chunks, key=itemgetter("X")):
writer.writerow(row)
From the documentation on heapq.merge:
Merge multiple sorted inputs into a single sorted output (for example,
merge timestamped entries from multiple log files). Returns an
iterator over the sorted values.
Similar to sorted(itertools.chain(*iterables)) but returns an
iterable, does not pull the data into memory all at once, and assumes
that each of the input streams is already sorted (smallest to
largest).
So using as you can read in the above quote (emphasis mine) using heapq.merge won't load all data into memory. Is also worth noting that the complexity of this function is O(n) where n is the size of the whole data. Therefore the overall sorting algorithm is O(nlogn)

Related

How to improve the speed of reading multiple csv files in python

It's my first time creating a code for processing files with a lot of data, so I am kinda stuck here.
What I'm trying to do is to read a list of path, listing all of the csv files that need to be read, retrieve the HEAD and TAIL from each files and put it inside a list.
I have 621 csv files in total, with each files consisted of 5800 rows, and 251 columns
This is the data sample
[LOGGING],RD81DL96_1,3,4,5,2,,,,
LOG01,,,,,,,,,
DATETIME,INDEX,SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0]
TIME,INDEX,FF-1(１A) ,FF-1(１B) ,FF-1(１C) ,FF-1(２A),FF-2(１A) ,FF-2(１B) ,FF-2(１C),FF-2(２A)
47:29.6,1,172,0,139,1258,0,0,400,0
47:34.6,2,172,0,139,1258,0,0,400,0
47:39.6,3,172,0,139,1258,0,0,400,0
47:44.6,4,172,0,139,1263,0,0,400,0
47:49.6,5,172,0,139,1263,0,0,450,0
47:54.6,6,172,0,139,1263,0,0,450,0
The problem is, while it took about 13 seconds to read all the files (still kinda slow honestly)
But when I add a single line of append code, the process took a lot of times to finish, about 4 minutes.
Below is the snipset of the code:
# CsvList: [File Path, Change Date, File size, File Name]
for x, file in enumerate(CsvList):
timeColumn = ['TIME']
df = dd.read_csv(file[0], sep =',', skiprows = 3, encoding= 'CP932', engine='python', usecols=timeColumn)
# The process became long when this code is added
startEndList.append(list(df.head(1)) + list(df.tail(1)))
Why that happened? I'm using dask.dataframe

Currently, your code isn't really leveraging Dask's parallelizing capabilities because:
df.head and df.tail calls will trigger a "compute" (i.e., convert your Dask DataFrame into a pandas DataFrame -- which is what we try to minimize in lazy evaluations with Dask), and
the for-loop is running sequentially because you're creating Dask DataFrames and converting them to pandas DataFrames, all inside the loop.
So, your current example is similar to just using pandas within the for-loop, but with the added Dask-to-pandas-conversion overhead.
Since you need to work on each of your files, I'd suggest checking out Dask Delayed, which might be more elegant+ueful here. The following (pseudo-code) will parallelize the pandas operation on each of your files:
import dask
import pandas as pd
for file in list_of_files:
df = dask.delayed(pd.read_csv)(file)
result.append(df.head(1) + df.tail(1))
dask.compute(*result)
The output of dask.visualize(*result) when I used 4 csv-files confirms parallelism:
If you really want to use Dask DataFrame here, you may try to:
read all files into a single Dask DataFrame,
make sure each Dask "partition" corresponds to one file,
use Dask Dataframe apply to get the head and tail values and append them to a new list
call compute on the new list

A first approach using only Python as starting point:
import pandas as pd
import io
def read_first_and_last_lines(filename):
with open(filename, 'rb') as fp:
# skip first 4 rows (headers)
[next(fp) for _ in range(4)]
# first line
first_line = fp.readline()
# start at -2x length of first line from the end of file
fp.seek(-2 * len(first_line), 2)
# last line
last_line = fp.readlines()[-1]
return first_line + last_line
data = []
for filename in pathlib.Path('data').glob('*.csv'):
data.append(read_first_and_last_lines(filename))
buf = io.BytesIO()
buf.writelines(data)
buf.seek(0)
df = pd.read_csv(buf, header=None, encoding='CP932')

What is the fastest way to filter over 2.5gb Json file?

I have 2.5 GB of JSON file with 25 columns and about 4 million rows. I try to filter the JSON with the following script it takes at least 10 minutes.
import json
product_list = ['Horse','Rabit','Cow']
year_list = ['2008','2009','2010']
country_list = ['USA','GERMANY','ITALY']
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = r.read()
result = json.loads(result)
for item in result[:]:
if (not str(item["Year"]) in year_list) or (not item["Name"] in product_list) or (not item["Country"] in country_list):
result.remove(item)
print(result)
I need to prepare the result in a max of 1 minute so what is your suggestion or the fastest way to filter JSON?

Removing from a list in a loop is slower, each remove is O(n) and that is done n times so O(n^2), appending to a new list is O(1) and doing this n times is O(n) in a loop. So you can try this
[item for item in result if str(item["Year"] in year_list) or (item["Name"] in product_list) or (item["Country"] in country_list)]
Filter based on the condition you need and add only those that match.

You need to read the json file using Pandas dataframes, and then filter on the required columns.
Why ? because Pandas is column-based and therefore it's super fast for working with columns, it's built upon Series which is a one-dimensional labeled array (basically the column).
So you need something like that: (Assuming column names in json file are consistent)
import pandas as pd
product_list = ['Horse','Rabit','Cow']
year_list = ['2008','2009','2010']
country_list = ['USA','GERMANY','ITALY']
df = pd.read_json('./products/animal_production.json')
# Change the condition if it's not the desired one
condition = (df["Year"].isin(year_list) | (df["Name"].isin(product_list) | (df["Country"].isin(country_list)
df = df[condition]
I can't reproduce it to estimate the time needed but I am sure it would be hundreds or even thousands of times faster!

I do not know if it will be much faster, but you might json.load rather than read-ing then json.loadsing i.e. rather than
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = r.read()
result = json.loads(result)
you might do
with open('./products/animal_production.json', 'r', encoding='utf8') as r:
result = json.load(r)

Trying to split csv column data into lists after reading in using pandas library

I have a csv file containing 3 columns of data: column 1 = time vector, column 2 is untuned circuit response and column 3 is the tuned circuit response.
I am reading in this csv data in python using pandas:
df = pd.read_csv(filename, delimiter = ",")
I am now trying to create 3 lists, 1 list for each column of data. I have tried the following but not working as lists end up empty:
for col in df:
time.append(col[0])
untuned.append(col[1])
tuned.append(col[2])
Can anyone give me some help on this. Thanks.

You can use pandas series tolist method:
time = df['time vector'].tolist()
untuned = df['untuned circuit'].tolist()
tuned = df['tuned circuit'].tolist()

To be honest, if your use case it to just get it in lists, use csvreader. It reduces a lot of overhead.
import csv
time = list()
untuned = list()
tuned = list()
with open("filename.csv") as csv_data_file:
csv_reader = csv.reader(csv_data_file, delimiter=",")
for each_row in csv_reader:
time.append(each_row[0])
untuned.append(each_row[1])
tuned.append(each_row[2])
If you have other use cases that require pandas or that your file is large and you require the power of pandas, use .tolist() as suggested by #Bruno Mello.
You can also use an iterator.
for index, row in df.iterrows():
time.append(row[0])
untuned.append(row[1])
tuned.append(row[2])

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv

IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).

create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

Pandas read_hdf: how to get column names when using chunksize or iterator?

I'm reading in a large (~10 GB) hdf5 table with pandas.read_hdf. I'm using iterator=True so that I can access chunks at a time (e.g., chunksize=100000 rows at a time).
How do I get a list of all the column names or 'keys'?
Also, how come there is no get_chunk method analogous to the one for pandas.read_table? Is directly iterating over the chunks the only way ("for chunk in data: "), and you can't access different numbered chunks at will ("data[300]")?
Edit:
Looks like I can access the column names with a loop that breaks after accessing the first chunk:
for i,v in enumerate(data):
if i != 0:
break
colnames = v.columns
But then my second question still remains: is there no way to access each individual chunk on the pandas TextFileReader iterator (e.g., mimicking the get_chunk method of read_table, or with a dict-like lookup, data[0]), instead of doing the above weird single-iteration for loop?

Have you tried loading the HDF5 file as an HDFStore? That would allow you to use the HDFStore.select method which may do what you want (with seeking, etc.). You can use select to only operate on a subset of columns too. To me it just looks like it provides more flexibility than the read_hdf function. The following might help as long as you know the structure of your HDF5 file:
store = pd.HDFStore('/path/to/file', 'r')
colnames = store.select('table_key', stop=1).columns
# iterate over table chunks
chunksize = 100000
chunks = store.select('table_key', chunksize=chunksize)
for chunk in chunks:
...code...
# select 1 specific chunk as iterator
chunksize = 100000
start, stop = 300*chunksize, 301*chunksize
this_chunk = store.select('table_key', start=start, stop=stop, iterator=True)
do_work(this_chunk)
store.close()
Note that you can also open an HDFStore as a context manager, e.g.,
with pd.HDFStore('/path/to/file', 'r') as store:
...code...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a big file using its chunks - python

Related

How to improve the speed of reading multiple csv files in python

What is the fastest way to filter over 2.5gb Json file?

Trying to split csv column data into lists after reading in using pandas library

Problem either with number of characters exceeding cell limit, or storing lists of variable length

Pandas read_hdf: how to get column names when using chunksize or iterator?

Categories

Resources