I refer to this question - dask dataframe read parquet schema difference
But the metadata returned by Dask does not indicate any differences between the different dataframes. Here is my code, which parses the exception details to find mismatched dtypes. It finds none. There are up to 100 dataframes with 717 columns (each is ~100MB in size).
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve), engine='pyarrow')
except Exception as ex:
# Process the ex message to find the diff, this will break if dask change their error message
msgs = str(ex).split('\nvs\n')
cols1 = msgs[0].split('metadata')[0]
cols1 = cols1.split('was different. \n')[1]
cols2 = msgs[1].split('metadata')[0]
df1_err = pd.DataFrame([sub.split(":") for sub in cols1.splitlines()])
df1_err = df1_err.dropna()
df2_err = pd.DataFrame([sub.split(":") for sub in cols2.splitlines()])
df2_err = df2_err.dropna()
df_err = pd.concat([df1_err, df2_err]).drop_duplicates(keep=False)
raise Exception('Mismatch dataframes - ' + str(df_err))
The exception I get back is:
'Mismatch dataframes - Empty DataFrame Columns: [0, 1] Index: []'
This error does not occur with fastparquet, but it is so slow that it is unusable.
I added this to the creation of the dataframes (using pandas to_parquet to save them) in an attempt to unify the dtypes by column
df_float = df.select_dtypes(include=['float16', 'float64'])
df = df.drop(df_float.columns, axis=1)
for col in df_float.columns:
df_float[col] = df_float.loc[:,col].astype('float32')
df = pd.concat([df, df_float], axis=1)
df_int = df.select_dtypes(include=['int8', 'int16', 'int32'])
try:
for col in df_int.columns:
df_int[col] = df_int.loc[:, col].astype('int64')
df = df.drop(df_int.columns, axis=1)
df = pd.concat([df, df_int], axis=1)
except ValueError as ve:
print('Error with upcasting - ' + str(ve))
This appears to work according to my above exception. But I cannot find out how the dataframes differ as the exception thrown by dask read_parquet does not tell me? Ideas on how to determine what it finds as different?
You could use the fastparquet function merge to create a metadata file from the many data files (this will take some time to scan all the files). Thereafter, pyarrow will use this metadata files, and that might be enough to get rid of the problem for you.
Related
I have 3 dataframe and I want to load then into spark, so I loaded the 3 files saved them into df and comined them to one dataframe. When I apply spark to the dataframe
I get this error "Failed to find data source: org.apache.spark.csv. Please find packages" Do I need to download org.apache.spark.csv in my local file then load it ?
dfA = pd.read_excel("/content/gdrive/MyDrive/Colab Notebooks/dfA.xlsx", names=['','Name', 'Prod_No','Category','URL','Description'])
dfB = pd.read_excel("/content/gdrive/MyDrive/Colab Notebooks/dfB.xlsx", names=['','Name', 'Prod_No','Category','URL','Description'])
dfC = pd.read_excel("/content/gdrive/MyDrive/Colab Notebooks/dfC.xlsx", names=['','Name', 'Prod_No','Category','URL','Description'])
#delete index column
dfA = df_7news.drop([''], axis=1)
dfB = df_theAge.drop([''], axis=1)
dfC =df_thenewDaily.drop([''], axis=1)
# reset index and merge dataframe and delete row with replay data
df = pd.concat([dfA,dfB,dfC],ignore_index=True)
df = df[df.Article.str.contains("Replay") == False]
data = spark.read.format("org.apache.spark.csv").option("header", "true").option("mode", "PERMISSIVE").option("inferSchema", "True").load(df)
You don’t need the full format, just use spark.read.format("csv")
You can even just use spark.read.csv(path)
The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :
df = pd.read_csv('mylog.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer','user_agent','req_time'],
converters={'status': int, 'size': int, 'req_time': int})
It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised :
either
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
or
ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'
For the sake of the example, here's the line that triggers the second exception:
22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282
To find the number of the incriminated line, I used the following (terribly slow) function :
def search_error_dichotomy(path):
borne_inf = 0
log = open(path)
borne_sup = len(log.readlines())
log.close()
while borne_sup - borne_inf>1:
exceded = False
search_index = (borne_inf + borne_sup) // 2
try:
pd.read_csv(path,...,...,nrows=search_index)
except:
exceded = True
if exceded:
borne_sup = search_index
else:
borne_inf = search_index
return search_index
What I'd like to have is something like this :
try:
pd.read_csv(..........................)
except MyError as e:
print(e.row_number)
where e.row_number is the number of the messy line.
Thank you in advance.
SOLUTION
All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :
Load the dataframe without converters.
df = pd.read_csv(path,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer', 'user_agent', 'req_time'])
Add an 'index' column using .reset_index() .
df = df.reset_index()
Write custom function (to be used with apply), that converts to int if possible, otherwise saves
the entry and the 'index' in a dictionary wrong_lines
wrong_lines = {}
def convert_int_feedback_index(row,col):
try:
ans = int(row[col])
except:
wrong_lines[row['index']] = row[col]
ans = pd.np.nan
return ans
Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')
df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)
Did you try pd.read_csv(..., nrows=10) to see if it works on even 10 lines?
Perhaps you should not use converters to specify the dtypes.
Load the DataFrame then apply the dtype to columns like df["column"] = df["column"].astype(np.int64) or a custom function like df["column"]=df["column"].apply(lambda x: convert_type(x)) and handle the errors yourself in the function convert_type.
Finally, update the csv by calling df.to_csv("preprocessed.csv", headers=True, index=False).
I don't think you can get the line number from the pd.read_csv itself. That separator itself looks too complex.
Or you can try just reading the csv as a single column DataFrame and use df["column"].str.extract to use regex to extract the columns. That way you control how the exception is to be raised or the default value to handle the error.
df.reset_index() will give you the row numbers as a column. That way if you apply to two columns, you will get the row number as well. It will give you index column with row numbers. Combine that with apply over multiple columns and you can customize everything.
For a given path, i process many GigaBytes of files inside, and yield dataframes for every processed one.
For every dataframe that is yield, which includes two string columns of varying size, I want to dump them to disk using the very efficient HDF5 format. The error is raised when the HDFStore.append procedure is called, for the 4th or 5th iteration.
I use the following routine(simplified) to build the dataframes:
def build_data_frames(path):
data = df({'headline': [],
'content': [],
'publication': [],
'file_ref': []},
columns=['publication','file_ref','headline','content'])
for curdir, subdirs, filenames in os.walk(path):
for file in filenames:
if (zipfile.is_zipfile(os.path.join(curdir, file))):
with zf(os.path.join(curdir, file), 'r') as arch:
for arch_file_name in arch.namelist():
if re.search('A[r|d]\d+.xml', arch_file_name) is not None:
xml_file_ref = arch.open(arch_file_name, 'r')
xml_file = xml_file_ref.read()
metadata = XML2MetaData(xml_file)
headlineTokens, contentTokens = XML2TokensParser(xml_file)
rows= [{'headline': " ".join(headlineTokens),
'content': " ".join(contentTokens)}]
rows[0].update(metadata)
data = data.append(df(rows,
columns=['publication',
'file_ref',
'headline',
'content']),
ignore_index=True)
arch.close()
yield data
Then I use the following method to write these dataframes to disk:
def extract_data(path):
hdf_fname = extract_name(path)
hdf_fname += ".h5"
data_store = HDFStore(hdf_fname)
for dataframe in build_data_frames(path):
data_store.append('df', dataframe, data_columns=True)
## passing min_itemsize doesn't work either
## data_store.append('df', dataframe, min_itemsize=8000)
## trying the "alternative" command didn't help
## dataframe.to_hdf(hdf_fname, 'df', format='table', append=True,
## min_itemsize=80000)
data_store.close()
->
%time load_data(publications_path)
And the ValueError I get is:
...
ValueError: Trying to store a string with len [5761] in [values_block_0]
column but this column has a limit of [4430]!
Consider using min_itemsize to preset the sizes on these columns
I tried all the options, went through all the documentation necessary for this task, and tried all the tricks I saw on the Internet. Yet, no idea why it happens.
I use pandas ver: 0.17.0
Appreciate your help very much!
Have you seen this post? stackoverflow
data_store.append('df',dataframe,min_itemsize={ 'string' : 5761 })
Change 'string' to your type.
Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?
This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.
The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')
Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.
Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris
As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)