Using regex pattern to read files from directories

Using regex pattern to read files from directories - python

I have directories with the following names:
s3://bucket/elig_date=2020-06-01/
s3://bucket/elig_date=2020-06-02/
....
s3://bucket/elig_date=2020-09-30/
s3://bucket/elig_date=2020-10-01/
...
s3://bucket/elig_date=2020-12-31/
When I want to read all files inside all directories from 2020-06-01 to 2020-09-30, I use the following and it works:
import dask.dataframe as dd
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]-*/*")
But, I want to extend this upto the directory 2020-12-31, I am trying the following and it doesn't work:
all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
This throws the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-61-60da829cf51e> in <module>
----> 1 all_data = dd.read_parquet("s3://bucket/elig_date=2020-0[6-9]|1[0-2]-*/*")
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, **kwargs)
333 index = [index]
334
--> 335 meta, statistics, parts, index = engine.read_metadata(
336 fs,
337 paths,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, engine, **kwargs)
497 split_row_groups,
498 gather_statistics,
--> 499 ) = cls._gather_metadata(
500 paths,
501 fs,
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, read_from_paths, dataset_kwargs)
1647
1648 # Step 1: Create a ParquetDataset object
-> 1649 dataset, base, fns = _get_dataset_object(paths, fs, filters, dataset_kwargs)
1650 if fns == [None]:
1651 # This is a single file. No danger in gathering statistics
~/anaconda3/envs/3.8.1/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _get_dataset_object(paths, fs, filters, dataset_kwargs)
1600 if proxy_metadata:
1601 dataset.metadata = proxy_metadata
-> 1602 elif fs.isdir(paths[0]):
1603 # This is a directory. We can let pyarrow do its thing.
1604 # Note: In the future, it may be best to avoid listing the
IndexError: list index out of range

I only tested it on regExr because I do not have your files.
But this worked on there:
s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-*/*
Same as you had, just with brackets

Related

My code broke with ValueError: Index contains duplicate entries, cannot reshape when using yaho data reader

The code worked just fine but now it gives me this error after these lines:
end = dt.datetime.now()
start = dt.date(end.year - 3, end.month, end.day)
prices = reader.get_data_yahoo(tickers,start,end)['Adj Close']
I tried upgrading packages and everything but it didn't help.The code doesn't work now even for the data I previously successfully downloaded and analysied via it.
ValueError Traceback (most recent call last)
Input In [6], in <cell line: 3>()
1 end = dt.datetime.now()
2 start = dt.date(end.year - 3, end.month, end.day)
----> 3 prices = reader.get_data_yahoo(tickers,start,end)['Adj Close']
File C:\Python310\lib\site-packages\pandas_datareader\data.py:80, in get_data_yahoo(*args, **kwargs)
79 def get_data_yahoo(*args, **kwargs):
---> 80 return YahooDailyReader(*args, **kwargs).read()
File C:\Python310\lib\site-packages\pandas_datareader\base.py:256, in _DailyBaseReader.read(self)
254 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
255 elif isinstance(self.symbols, DataFrame):
--> 256 df = self._dl_mult_symbols(self.symbols.index)
257 else:
258 df = self._dl_mult_symbols(self.symbols)
File C:\Python310\lib\site-packages\pandas_datareader\base.py:285, in _DailyBaseReader._dl_mult_symbols(self, symbols)
283 stocks[sym] = df_na
284 if PANDAS_0230:
--> 285 result = concat(stocks, sort=True).unstack(level=0)
286 else:
287 result = concat(stocks).unstack(level=0)
File C:\Python310\lib\site-packages\pandas\core\frame.py:8413, in DataFrame.unstack(self, level, fill_value)
8351 """
8352 Pivot a level of the (necessarily hierarchical) index labels.
8353
(...)
8409 dtype: float64
8410 """
8411 from pandas.core.reshape.reshape import unstack
-> 8413 result = unstack(self, level, fill_value)
8415 return result.__finalize__(self, method="unstack")
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:478, in unstack(obj, level, fill_value)
476 if isinstance(obj, DataFrame):
477 if isinstance(obj.index, MultiIndex):
--> 478 return _unstack_frame(obj, level, fill_value=fill_value)
479 else:
480 return obj.T.stack(dropna=False)
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:501, in _unstack_frame(obj, level, fill_value)
499 def _unstack_frame(obj, level, fill_value=None):
500 if not obj._can_fast_transpose:
--> 501 unstacker = _Unstacker(obj.index, level=level)
502 mgr = obj._mgr.unstack(unstacker, fill_value=fill_value)
503 return obj._constructor(mgr)
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
133 if num_cells > np.iinfo(np.int32).max:
134 warnings.warn(
135 f"The following operation may generate {num_cells} cells "
136 f"in the resulting pandas object.",
137 PerformanceWarning,
138 )
--> 140 self._make_selectors()
File C:\Python310\lib\site-packages\pandas\core\reshape\reshape.py:192, in _Unstacker._make_selectors(self)
189 mask.put(selector, True)
191 if mask.sum() < len(self.index):
--> 192 raise ValueError("Index contains duplicate entries, cannot reshape")
194 self.group_index = comp_index
195 self.mask = mask
ValueError: Index contains duplicate entries, cannot reshape

I know it can be frustrating but for the moment you have to read each ticker individually. The API is probably broken since the lastest versions of Pandas:
tickers = ['AAPL', 'MSFT']
end = dt.datetime.now()
start = dt.date(end.year - 3, end.month, end.day)
data = {}
for ticker in tickers:
data[ticker] = reader.get_data_yahoo(ticker, start, end)['Adj Close']
prices = pd.concat(data, axis=1)
Output:
>>> prices
AAPL MSFT
Date
2019-03-11 43.548748 109.345795
2019-03-12 44.038033 110.111404
2019-03-13 44.232773 110.964211
2019-03-14 44.724491 111.051437
2019-03-15 45.306278 112.330688
... ... ...
2022-03-07 159.300003 278.910004
2022-03-08 157.440002 275.850006
2022-03-09 162.949997 288.500000
2022-03-10 158.520004 285.589996
2022-03-10 158.520004 285.589996
[759 rows x 2 columns]

The error occurs when the query is made on a Saturday or Sunday, since Yahoo Finance repeats the data for Friday twice.
You can check it by looking at the historical data in finance yahoo itself.
For a single stock can be solved with:
data = data[~data.index.duplicated(keep='last')]
But, when downloading info for a list of stocks, , the solution is proposed by iterating over said list and then concatenating the series to construct the df.
Then you can use the code above to remove the duplicate indexes.

How to save a large pandas dataframe with compex arrays and load it up again?

I have a large pandas DataFrame with individual elements that are complex numpy arrays. Please see below a minimal code example to reproduce the scenario:
d = {f'x{i}': [] for i in range(4)}
df = pd.DataFrame(data=d).astype(object)
for K in range(4):
for i in range(4):
df.loc[f'{K}', f'x{i}'] = np.random.random(size=(2,2)) + np.random.random(size=(2,2)) * 1j
df
What is the best way to save these and load them up again for use later?
The problem I am having is that when I increase the size of the matrices stored and the number of elements, I get an OverflowError when I try to save it as .h5 file as shown below:
import pandas as pd
size = (300,300)
xs = 1500
d = {f'x{i}': [] for i in range(xs)}
df = pd.DataFrame(data=d).astype(object)
for K in range(10):
for i in range(xs):
df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
df.to_hdf('test.h5', key="df", mode="w")
load_test = pd.read_hdf("test.h5", "df")
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
<ipython-input-124-8cb8df1a0653> in <module>
12 df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
13
---> 14 df.to_hdf('test.h5', key="df", mode="w")
15
16
~/PQKs/pqks/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
2447 data_columns=data_columns,
2448 errors=errors,
-> 2449 encoding=encoding,
2450 )
2451
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
268 path_or_buf, mode=mode, complevel=complevel, complib=complib
269 ) as store:
--> 270 f(store)
271 else:
272 f(path_or_buf)
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
260 data_columns=data_columns,
261 errors=errors,
--> 262 encoding=encoding,
263 )
264
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times)
1127 encoding=encoding,
1128 errors=errors,
-> 1129 track_times=track_times,
1130 )
1131
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
1799 nan_rep=nan_rep,
1800 data_columns=data_columns,
-> 1801 track_times=track_times,
1802 )
1803
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, **kwargs)
3189 # I have no idea why, but writing values before items fixed #2299
3190 blk_items = data.items.take(blk.mgr_locs)
-> 3191 self.write_array(f"block{i}_values", blk.values, items=blk_items)
3192 self.write_index(f"block{i}_items", blk_items)
3193
~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write_array(self, key, value, items)
3047
3048 vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
-> 3049 vlarr.append(value)
3050
3051 elif empty_array:
~/PQKs/pqks/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
526 nparr = None
527
--> 528 self._append(nparr, nobjects)
529 self.nrows += 1
530
~/PQKs/pqks/lib/python3.6/site-packages/tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()
OverflowError: value too large to convert to int

As noted in the similar issue https://stackoverflow.com/a/57133759/8896855, hdf/h5 files have more overhead and are intended to optimize many dataframes saved into a single file system. Feather and parquet objects will likely provide a better solution in terms of saving/loading a larger single dataframe as an in-memory object. In terms of the specific overflow error, this likely is the result of having larger mixed-type (as numpy array) columns stored in the "object" type in pandas. One (more complicated) option would be to split out the arrays in your dataframe into separate columns, but that's probably unnecessary.
A general quick fix would be to use df.to_pickle(r'path_to/filename.pkl'), but to_feather or to_parquet likely present more optimized solutions.

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

I am loading multiple parquet files containing timeseries data together. But the loaded dask dataframe has unknown partitions because of which I can't apply various time series operations on it.
df = dd.read_parquet('/path/to/*.parquet', index='Timestamps)
For instance, df_resampled = df.resample('1T').mean().compute() gives following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet', index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self, rule, closed, label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self, rule, closed=closed, label=label)
2630
2631 #derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self, obj, rule, **kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with known divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
I went to the link: https://docs.dask.org/en/latest/dataframe-design.html#partitions and it says,
In these cases (when divisions are unknown), any operation that requires a cleanly partitioned DataFrame with known divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
I then tried following, but no success.
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
This step throws the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_DATA_DIR, '20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***failed resolving arguments***)
3915 npartitions=npartitions,
3916 divisions=divisions,
-> 3917 **kwargs,
3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df, index, npartitions, shuffle, compute, drop, upsample, divisions, partition_size, **kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions, upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self, key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
Can anybody suggest what is the right way to load multiple timeseries files as a dask dataframe on which timeseries operations of pandas can be applied?

TypeError in read_parquet Dask

I have a parquet file called data.parquet. I'm using the library dask from Python. When I run the line
import dask.dataframe as dd
df = dd.read_parquet('data.parquet',engine='pyarrow')
I get the error
TypeError Traceback (most recent call last)
<ipython-input-22-807fa43763c1> in <module>
----> 1 df = dd.read_parquet('data.parquet',engine='pyarrow')
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, infer_divisions)
1395 categories=categories,
1396 index=index,
-> 1397 infer_divisions=infer_divisions,
1398 )
1399
~/anaconda3/lib/python3.7/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, fs_token, paths, columns, filters, categories, index, infer_divisions)
858 _open = lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
859 for piece in dataset.pieces:
--> 860 pf = piece.get_metadata(_open)
861 # non_empty_pieces.append(piece)
862 if pf.num_row_groups > 0:
TypeError: get_metadata() takes 1 positional argument but 2 were given
I just don't understand why this happens, since this is how it is implemented here.
Any help will be appreciated!

I faced the same problem. I resolved by upgrade version dask 2.30.0

using negative indices with Pathlib Parents method

I'm trying to use groupby on a pandas data structure containing pathlib information and file sizes from a particular drive. I want to total up the storage used at a particular depth of the file directory structure to see which directories are the most full.
I was trying to do a summation groupby on the Pathlib Parent value for each file but that still doesn't tell you what your total storage is at a particular depth. Pathlib "Parents" looked promising but it starts with the full path and works backwards, so I tried reverse indexing but it doesn't seem to work.
From what I read in the documentation Pathlib Parents are supposed to be sequences, which are supposed to support reverse indexes, but the error messages seem to imply they don't do negatives.
Here is the code I've been using (with help from http://pbpython.com/pathlib-intro.html)
import pandas as pd
from pathlib import Path
import time
dir_to_scan = "c:/Program Files"
p = Path(dir_to_scan)
all_files = []
for i in p.rglob('*.*'):
all_files.append((i.name, i.parent,i.stat().st_size))
columns = ["File_Name", "Parent", "Size"]
df = pd.DataFrame.from_records(all_files, columns=columns)
df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-2] )
The error trace is as follows:
IndexError Traceback (most recent call last)
<ipython-input-3-5748b1f0a9ee> in <module>()
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-3-5748b1f0a9ee> in <lambda>(x)
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\pathlib.py in __getitem__(self, idx)
592 def __getitem__(self, idx):
593 if idx < 0 or idx >= len(self):
--> 594 raise IndexError(idx)
595 return self._pathcls._from_parsed_parts(self._drv, self._root,
596 self._parts[:-idx - 1])
IndexError: -1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex pattern to read files from directories - python

I only tested it on regExr because I do not have your files. But this worked on there: s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-/ Same as you had, just with brackets

Related

My code broke with ValueError: Index contains duplicate entries, cannot reshape when using yaho data reader

How to save a large pandas dataframe with compex arrays and load it up again?

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

TypeError in read_parquet Dask

using negative indices with Pathlib Parents method

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex pattern to read files from directories - python

I only tested it on regExr because I do not have your files. But this worked on there: s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-*/* Same as you had, just with brackets

Related

My code broke with ValueError: Index contains duplicate entries, cannot reshape when using yaho data reader

How to save a large pandas dataframe with compex arrays and load it up again?

Dask Dataframe: Resample partitioned data loaded from multiple parquet files

TypeError in read_parquet Dask

using negative indices with Pathlib Parents method

Categories

Resources

I only tested it on regExr because I do not have your files. But this worked on there: s3://bucket/elig_date=2020-(0[6-9])|(1[0-2])-/ Same as you had, just with brackets