I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:
import pandas as pd
from pathlib import Path
data_dir = Path("path_of_folder_with_files")
df = pd.concat(
pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
for parquet_file in data_dir.glob("*")
)
Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?
import pyarrow.parquet as pq
table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()
You could implement it using arrow instead of pandas:
batches = []
for file_name in data_dir.glob("*"):
table = pq.read_table(file_name)
table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
batches.extend(table.to_batches())
return pa.Table.from_batches(batches)
I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).
Related
I am trying to read multiple parquet files with selected columns into one Pandas dataframe. This means that the parquet files don't share all the columns. I tried to add a filter() argument into the pd.read_parquet() but it seems that it doesn't work in the multiple file reading. How can I make this work?
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df = pd.concat(
pd.read_parquet(parquet_file, filters=[('name', 'address', 'email')])
for parquet_file in data_dir.glob('*.parquet')
)
Reading from multiple files is well supported. However, if your schemas are different then it is a bit trickier. Pyarrow currently defaults to using the schema of the first file it finds in a dataset. This is to avoid the up-front cost of inspecting the schema of every file in a large dataset.
Arrow-C++ has the capability to override this and scan every file but this is not yet exposed in pyarrow. However, if you know the unified schema ahead of time you can supply it and you will get the behavior you want. You will need to use the datasets module directly to do this as specifying a schema is not part of pyarrow.parquet.read_table (which is what is called by pandas.read_parquet).
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pandas as pd
import tempfile
tab1 = pa.Table.from_pydict({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
tab2 = pa.Table.from_pydict({'b': ['a', 'b', 'c'], 'c': [True, False, True]})
unified_schema = pa.unify_schemas([tab1.schema, tab2.schema])
with tempfile.TemporaryDirectory() as dataset_dir:
pq.write_table(tab1, f'{dataset_dir}/one.parquet')
pq.write_table(tab2, f'{dataset_dir}/two.parquet')
print('Basic read of directory will use schema from first file')
print(pd.read_parquet(dataset_dir))
print()
print('You can specify the unified schema if you know it')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table().to_pandas())
print()
print('The columns option will limit which columns are returned from read_parquet')
print(pd.read_parquet(dataset_dir, columns=['b']))
print()
print('The columns option can be used when specifying a schema as well')
dataset = ds.dataset(dataset_dir, schema=unified_schema)
print(dataset.to_table(columns=['b', 'c']).to_pandas())
If you don't know the unified schema ahead of time you can create it by inspecting all the files yourself:
# You could also use glob here or whatever tool you want to
# get the list of files in your dataset
dataset = ds.dataset(dataset_dir)
schemas = [pq.read_schema(dataset_file) for dataset_file in dataset.files]
print(pa.unify_schemas(schemas))
Since this might be expensive (especially if working with a remote filesystem) you may want to save off the unified schema in its own file (saving a parquet file or Arrow IPC file with 0 batches is usually sufficient) instead of recalculating it every time.
I would like to only import a subset of a csv as a dataframe as it is too large to import the whole thing. Is there a way to do this natively in pandas without having to set up a database like structure?
I have tried only importing a chunk and then concatenating and this is still too large and causes memory error. I have hundreds of columns so manually specifying dtypes could help, but would likely be a major time commitment.
df_chunk = pd.read_csv("filename.csv", chunksize=1e7)
df = pd.concat(df_chunk,ignore_index=True)
You may use the skiprows and nrows arguments in the read_csv function to load only a subset of rows from your original dataframe.
For instance:
import pandas as pd
df = pd.read_csv("test.csv", skiprows = 4, nrows=10)
I have around 60 .csv files which i would like to combine in pandas. So far i've used this:
import pandas as pd
import glob
total_files = glob.glob("something*.csv")
data = []
for csv in total_files:
list = pd.read_csv(csv, encoding="utf-8", sep='delimiter', engine='python')
data.append(list)
biggerlist = pd.concat(data, ignore_index=True)
biggerlist.to_csv("output.csv")
This works somewhat, only the files I would like to combine all have the same structure of 15 columns with the same headers. When I use this code, only one column is filled with info of the entire row, and every column name is add-up of all column names (e.g. SEARCH_ROW, DATE, TEXT, etc.).
How can I combine these csv files, while keeping the same structure of the original files?
Edit:
So perhaps I should be a bit more specific regarding my data. This is a snapshot of one of the .csv files i'm using:
As you can see it is just newspaper-data, where the last column is 'TEXT', which isn't shown completely when you open the file.
This is a part of how it looks when i have combined the data using my code.
Apart, i can read any of these .csv files no problem using
data = pd.read_csv("something.csv",encoding="utf-8", sep='delimiter', engine='python')
I solved it!
The problem was the amount of comma's in the text part of my .csv files. So after removing all comma's (just using search/replace), I used:
import pandas
import glob
filenames = glob.glob("something*.csv")
df = pandas.DataFrame()
for filename in filenames:
df = df.append(pandas.read_csv(filename, encoding="utf-8", sep=";"))
Thanks for all the help.
I regularly use dask.dataframe to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.
The idea is that different logic can then be applied depending on the source.
Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:
include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:
import pandas as pd
import dask.dataframe as dd
from dask import delayed
def read_and_label_csv(filename):
# reads each csv file to a pandas.DataFrame
df_csv = pd.read_csv(filename)
df_csv['partition'] = filename.split('\\')[-1]
return df_csv
# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)
With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.
I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.