Empty Pandas dataframe when attempting to threshold - python
I am attempting to threshold a pandas dataframe which contains gene id's and statistical information. The input to my python program is a config.yaml file that holds the initial threshold values and a path to a CSV file (the eventual dataframe). The problem that I seem to be running into stems from the passing of my threshold variables into a "cut-down" dataframe. I am able to successfully threshold when using the integer values (in a deprecated method), but I receive an empty dataframe when trying to threshold using variables pointing to values in the config file.
Below is my current implementation:
config = yaml.full_load(file)
# for item, doc in config.items():
# print (item, ":", doc)
input_path = config['DESeq_input']['path']
# print(input_path)
baseMean = config['baseMean']
log2FoldChange = config['log2FoldChange']
lfcSE = config['lfcSE']
pvalue = config['pvalue']
padj = config['padj']
df = pd.read_csv(input_path)
# print if 0 < than padj for test
# convert to #, most likely being read as string
# now use threshold value to cut down CSV
# only columns defined in config.yaml file
df_select = df[['genes', 'baseMean', 'log2FoldChange', 'lfcSE', 'pvalue', 'padj']]
# print(df_select)
# print(df_select['genes'])
df_threshold = df_select.loc[(df_select['baseMean'] < baseMean)
& (df_select['log2FoldChange'] < log2FoldChange)
& (df_select['lfcSE'] < lfcSE)
& (df_select['pvalue'] < pvalue)
& (df_select['padj'] < padj)]
print(df_threshold)
And below is my (deprecated) implementation (that works):
df = pd.read_csv('/Users/nmaki/Documents/GitHub/IDEA/tests/eDESeq2.csv')
df_select = df[['genes', 'pvalue', 'padj', 'log2FoldChange']]
df_threshold = df_select.loc[(df_select['pvalue'] < 0.05)
& (df_select['padj'] < 0.1)
& (df_select['log2FoldChange'] < 0.5)]
print(df_threshold)
Upon execution of my current code get:
Empty DataFrame
Columns: [genes, baseMean, log2FoldChange, lfcSE, pvalue, padj]
Index: []
Example contents of the csv file I am loading in as a dataframe:
"genes","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj"
"ENSDARG00000000001",98.1095154977918,-0.134947665995593,0.306793322887575,-0.439865068527078,0.660034837008121,0.93904992415549
"ENSDARG00000000002",731.125841719954,0.666095249996351,0.161764851506172,4.11767602043598,3.82712199388831e-05,0.00235539468663284
"ENSDARG00000000018",367.699187187462,-0.170546910862128,0.147128047078344,-1.1591733476304,0.246385533026112,0.756573630543937
"ENSDARG00000000019",1133.08821430092,-0.131148919306121,0.104742185100469,-1.25211173683576,0.210529151546469,0.718240791187956
"ENSDARG00000000068",397.13408030651,-0.111332941901299,0.161417383863387,-0.689720891496564,0.49036972534723,0.8864754582597
"ENSDARG00000000069",1886.21783387126,-0.107901197025113,0.113522109960702,-0.950486183374019,0.341865271089735,0.82295928359482
"ENSDARG00000000086",246.197553048504,0.390421091410488,0.215725761369183,1.80980282063921,0.0703263703690051,0.466064880589034
"ENSDARG00000000103",797.782152145232,0.236382332789599,0.145111727277908,1.62896781138092,0.103319833277229,0.550658656731341
"ENSDARG00000000142",26.1411622212853,0.248419645848534,0.495298350652519,0.501555568519983,0.615980180267141,0.927327861190167
"ENSDARG00000000151",121.397701922367,0.276123125224845,0.244276041791451,1.13037333993066,0.25831894300396,0.766841249972654
"ENSDARG00000000161",22.2863001989718,0.837640942615127,0.542200061816621,1.54489274643135,0.122372208261173,0.587106227452529
"ENSDARG00000000183",215.47910609869,0.567221763062732,0.188807351259458,3.00423558340829,0.00266249076445763,0.0615311290935424
"ENSDARG00000000189",620.819069705942,0.0525797819665496,0.142171888686286,0.369832478504743,0.711507313969775,0.950479626809728
"ENSDARG00000000212",54472.1417532637,0.344813324409911,0.130070467015575,2.65097321722249,0.00802602056136946,0.132041563800088
"ENSDARG00000000229",172.985864037855,-0.0814838221355631,0.22200915791162,-0.367029103222856,0.713597309421024,0.95157821096128
"ENSDARG00000000241",511.449190233542,-0.431854805500191,0.157764756166574,-2.73733383801019,0.0061939401710654,0.114238610824236
"ENSDARG00000000324",179.189751392247,0.0141623609187069,0.206197755704643,0.0686833902256096,0.945241639658214,0.992706066946251
"ENSDARG00000000349",13.6578995386995,0.86981405362392,0.716688718472183,1.21365668414338,0.224878851627296,0.731932542953245
"ENSDARG00000000369",9.43959070533812,-0.042383076946964,0.868977019485631,-0.0487735302506061,0.961099776861288,NA
"ENSDARG00000000370",129.006520833067,0.619490133053518,0.250960632807829,2.46847533863165,0.0135690001510168,0.184768676917612
"ENSDARG00000000380",17.695581482726,-0.638493654324115,0.597289695632778,-1.06898488119351,0.285076482019819,0.786103920659844
"ENSDARG00000000394",2200.41651475378,-0.00605761754099435,0.0915611724486909,-0.0661592395443486,0.947251047773153,0.992978480118812
"ENSDARG00000000423",195.477813443242,-0.18634265895713,0.188820984694016,-0.986874733542448,0.323704052061987,0.810439992736898
"ENSDARG00000000442",1102.47980192551,0.0589654622770368,0.112333519273845,0.524914225586502,0.599642819781172,0.920807266898811
"ENSDARG00000000460",8.52822266110357,0.229130838495461,0.957763036484278,0.239235416034165,0.810923041830713,NA
"ENSDARG00000000472",0.840917787550721,-0.4234502342491,3.1634759582284,-0.133855998857105,0.893516444899853,NA
"ENSDARG00000000474",5.12612778660879,0.394871266508097,1.07671345623418,0.366737560696199,0.713814786364707,NA
"ENSDARG00000000476",75.8417047936895,0.242006157627571,0.349451220882324,0.692532013528336,0.488603288756242,0.885874315527816
"ENSDARG00000000489",1233.33364888202,0.0676458807753533,0.131846296650645,0.513066217965876,0.607905001380741,0.924392802283811
As it turns out, my thresholds were too restrictive (I had added 2 additional variables that did not exist in my original implementation). I am receiving a populated dataframe now.
Related
Issues in converting sas macro to pandas
I am new to pandas, and I'm learning it through its web documentation. I am facing issues in converting the following SAS code to pandas. My SAS code: data tmp2; set tmp1; retain group 0; if _n_=1 and group_v1 = -1 then group = group_v1; else if _n_=1 and group_v1 ne -1 then group=0; else group=group+1; run; Note: In the above code group_v1 is a column from tmp1
There may be a more succinct and efficient way to do this in pandas, but this approach quite closely matches what SAS does internally when your code is run: tmp1 = pd.DataFrame({"group_v1": [-1, 0, 1]}) def build_tmp2(tmp1): # Contains the new rows for tmp2 _tmp2 = [] # Loop over the rows of tmp1 - like a data step does for i, row in tmp1.iterrows(): # equivalent to the data statement - copy the current row to memory tmp2 = row.copy() # _N_ is equivalent to i, except i starts at zero in Pandas/Python if i == 0: # Create a new variable called pdv to contain values across loops # This is equivalent to the Program Data Vector in SAS pdv = {} if row['group_v1'] == -1: pdv['group'] = row['group_v1'] else: pdv['group'] = 0 # Equivalent to both retain group and also group=group+1 pdv['group']+=1 # Copy the accumulating group variable to the target row tmp2['group'] = pdv['group'] # Append the updated row to the list _tmp2.append(tmp2.copy()) # After the loop has finished build the new DataFrame from the list return pd.DataFrame(_tmp2) build_tmp2(tmp1)
How to filter multiple dataframes and append a string to the save filenames?
The reason I'm trying to accomplish this is to use lots of variable names to create lots of new variable names containing the names of the original variables. For example, I have several pandas data frames of inventory items in each location. I want to create new data frames containing only the the negative inventory items with '_neg' appended to the original variable names (inventory locations). I want to be able to do this with a for loop something like this: warehouse = pd.read_excel('warehouse.xls') retail = pd.read_excel('retailonhand.xls') shed3 = pd.read_excel('shed3onhand.xls') tank1 = pd.read_excel('tank1onhand.xls') tank2 = pd.read_excel('tank2onhand.xls') all_stock_sites = [warehouse,retail,shed3,tank1,tank2] all_neg_stock_sites = [] for site in all_stock_sites: string_value_of_new_site = (pseudo code):'site-->string_value_of_site' + '_neg' string_value_of_new_site = site[site.OnHand < 0] all_neg_stock_sites.append(string_value_of_new_site) to create something like this # create new dataframes for each stock site's negative 'OnHand' values warehouse_neg = warehouse[warehouse.OnHand < 0] retail_neg = retail[retail.OnHand < 0] shed3_neg = shed3[shed3.OnHand < 0] tank1_neg = tank1[tank1.OnHand < 0] tank2_neg = tank2[tank2.OnHand < 0] Without having to type out all 500 different stock site locations and appending '_neg' by hand.
My recommendation would be to not use variable names as the "keys" to the data, but rather assign them proper names, in a tuple or dict. So instead of: warehouse = pd.read_excel('warehouse.xls') retail = pd.read_excel('retailonhand.xls') shed3 = pd.read_excel('shed3onhand.xls') You would have: sites = {} sites['warehouse'] = pd.read_excel('warehouse.xls') sites['retail'] = pd.read_excel('retailonhand.xls') sites['shed3'] = pd.read_excel('shed3onhand.xls') ...etc Then you could create the negative keys like so: sites_neg = {} for site_name, site in sites.items(): neg_key = site_name + '_neg' sites_neg[neg_key] = site[site.OnHand < 0]
Use rglob from the pathlib module to create a list of existing files Python 3's pathlib Module: Taming the File System .parent .stem .suffix Use f-strings to update the file names PEP 498 - Literal String Interpolation Iterate through each file: Create a dataframe Filter the dataframe. An error will occur if the column doesn't exist (e.g. AttributeError: 'DataFrame' object has no attribute 'OnHand'), so we put the code in a try-except block. The continue statement, continues with the next iteration of the loop. Check that the dataframe is not empty. If it's not empty then... Add the dataframe to a dictionary for additional processing, if desired. Save the dataframe as a new file with _neg added to the file name from pathlib import Path import pandas as pd # set path to top file directory d = Path(r'e:\PythonProjects\stack_overflow\stock_sites') # get all xls files files = list(d.rglob('*.xls')) # create, filter and save dict of dataframe df_dict = dict() for file in files: # create dataframe df = pd.read_excel(file) try: # filter df and add to dict df = df[df.OnHand < 0] except AttributeError as e: print(f'{file} caused:\n{e}\n') continue if not df.empty: df_dict[f'{file.stem}_neg'] = df # save to new file new_path = file.parent / f'{file.stem}_neg{file.suffix}' df.to_excel(new_path, index=False) print(df_dict.keys()) >>> dict_keys(['retailonhand_neg', 'shed3onhand_neg', 'tank1onhand_neg', 'tank2onhand_neg', 'warehouse_neg']) # access individual dataframes as you would any dict df_dict['retailonhand_neg']
Randomization of a list with conditions using Pandas
I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried). The main aim is to create a randomization loop which takes original dataset looking like this: dataset From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order. I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this) import pandas as pd import random dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx") # original data set use for comparisons imageDataset = dataset.loc[0:11, :] # creating empty df for storing rows from imageDataset emptyExcel = pd.DataFrame() randomPick = imageDataset.sample() # select randomly one row from imageDataset emptyExcel = emptyExcel.append(randomPick) # append a row to empty df randomPickIndex = randomPick.index.tolist() # get index of the row imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before # getting raw values from the row 'position01'/02 are columns headers randomPickTemp1 = randomPick['position01'].values[0] randomPickTemp2 = randomPick randomPickTemp2 = randomPickTemp2['position02'].values[0] # getting a dataset which not including row values from position01 and position02 isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)] # pick another row from dataset not including row selected at the beginning - randomPick randomPick2 = isit.sample() # save it in empty df emptyExcel = emptyExcel.append(randomPick2, sort=False) # get index of this second row to delete it in next step randomPick2Index = randomPick2.index.tolist() # delete the another row imageDataset3 = imageDataset2.drop(index=randomPick2Index) # AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row: randomPickTemp1 = randomPick2['position01'].values[0] randomPickTemp2 = randomPick2 randomPickTemp2 = randomPickTemp2['position02'].values[0] isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)] # AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186 I've just adjusted the condition in for loop to my case like this: remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']] Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it import random import pandas as pd # list of pair of numbers tmp1 = [x for x in it.permutations(list(range(6)),2)] df = pd.DataFrame(tmp1, columns=["position01","position02"]) df1 = pd.DataFrame() i = random.choice(df.index) df1 = df1.append(df.loc[i],ignore_index = True) df = df.drop(index = i) while not df.empty: val = list(df1.iloc[-1]) tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])] if tmp.empty: #looped for 10000 times, was never empty print("here") break i = random.choice(tmp.index) df1 = df1.append(df.loc[i],ignore_index = True) df = df.drop(index=i)
Equivalent of arcpy.Statistics_analysis using NumPy (or other)
I am having a problem (I think memory related) when trying to do an arcpy.Statistics_analysis on an approximately 40 million row table. I am trying to count the number of non-null values in various columns of the table per category (e.g. there are x non-null values in column 1 for category A). After this, I need to join the statistics results to the input table. Is there a way of doing this using numpy (or something else)? The code I currently have is like this: arcpy.Statistics_analysis(input_layer, output_layer, "'Column1' COUNT; 'Column2' COUNT; 'Column3' COUNT", "Categories") I am very much a novice with arcpy/numpy so any help much appreciated!
You can convert a table to a numpy array using the function arcpy.da.TableToNumPyArray. And then convert the array to a pandas.DataFrame object. Here is an example of code (I assume you are working with Feature Class because you use the term null values, if you work with shapefile you will need to change the code as null values are not supported are replaced with a single space string (' '): import arcpy import pandas as pd # Change these values gdb_path = 'path/to/your/geodatabase.gdb' table_name = 'your_table_name' cat_field = 'Categorie' fields = ['Column1','column2','Column3','Column4'] # Do not change null_value = -9999 input_table = gdb_path + '\\' + table_name # Convert to pandas DataFrame array = arcpy.da.TableToNumPyArray(input_table, [cat_field] + fields, skip_nulls=False, null_value=null_value) df = pd.DataFrame(array) # Count number of non null values not_null_count = {field: {cat: 0 for cat in df[cat_field].unique()} for field in fields} for cat in df[cat_field].unique(): _df = df.loc[df[cat_field] == cat] len_cat = len(_df) for field in fields: try: # If your field contains integrer or float null_count = _df[field].value_counts()[int(null_value)] except IndexError: # If it contains text (string) null_count = _df[field].value_counts()[str(null_value)] except KeyError: # There is no null value null_count = 0 not_null_count[field][cat] = len_cat - null_count Concerning joining the results to the input table without more information, it's complicated to give you an exact answer that will meet your expectations (because there are multiple columns, so it's unsure which value you want to add). EDIT: Here is some additional code following your clarifications: # Create a copy of the table copy_name = '' # name of the copied table copy_path = gdb_path + '\\' + copy_name arcpy.Copy_management(input_table, copy_path) # Dividing copy data with summary # This step doesn't need to convert the dict (not_null_value) to a table with arcpy.da.UpdateCursor(copy_path, [cat_field] + fields) as cur: for row in cur: category = row[0] for i, fld in enumerate(field): row[i+1] /= not_null_count[fld][category] cur.updateRow(row) # Save the summary table as a csv file (if needed) df_summary = pd.DataFrame(not_null_count) df_summary.index.name = 'Food Area' # Or any name df_summary.to_csv('path/to/file.csv') # Change path # Summary to ArcMap Table (also if needed) arcpy.TableToTable_conversion('path/to/file.csv', gdb_path, 'name_of_your_new_table')
Pandas + HDF5 Panel data storage for large data
As part of my research, I am searching a good storing design for my panel data. I am using pandas for all in-memory operations. I've had a look at the following two questions/contributions, Large Data Work flows using Pandas and Query HDF5 Pandas as they come closest to my set-up. However, I have a couple of questions left. First, let me define my data and some requirements: Size: I have around 800 dates, 9000 IDs and up to 200 variables. Hence, flattening the panel (along dates and IDs) corresponds to 7.2mio rows and 200 columns. This might all fit in memory or not, let's assume it does not. Disk-space is not an issue. Variables are typically calculated once, but updates/changes probably happen from time to time. Once updates occur, old versions don't matter anymore. New variables are added from time to time, mostly one at a time only. New rows are not added. Querying takes place. For example, often I need to select only a certain date range like date>start_date & date<end_date. But some queries need to consider rank conditions on dates. For example, get all data (i.e. columns) where rank(var1)>500 & rank(var1)<1000, where rank is as of date. The objective is to achieve fast reading/querying of data. Data writing is not so critical. I thought of the following HDF5 design: Follow the groups_map approach (of 1) to store variables in different tables. Limit the number of columns for each group to 10 (to avoid large memory loads when updating single variables, see point 3). Each group represents one table, where I use the multi-index based on dates & ids for each table stored. Create an update function, to update variables. The functions loads the table with all (10) columns to memory as a df, deletes the table on the disk, replaces the updated variable in df and saves the table from memory back to disk. Create an add function, add var1 to a group with less than 10 columns, or create new group if required. Saving similar as in 3. load current group to memory, delete table on disk, add new column and save it back on disk. Calculate ranks as of date for relevant variables and add them to disk-storage as rank_var1, which should reduce the query as of to simply rank_var1 > 500 & rank_var1<1000. I have the following questions: Updating HDFTable, I suppose I have to delete the entire table in order to update a single column? When to use 'data_columns', or should I simply assign True in HDFStore.append()? If I want to query based on condition of rank_var1 > 500 & rank_var1<1000, but I need columns from other groups. Can I enter the index received from the rank_var1 condition into the query to get other columns based on this index (the index is a multi-index with date and ID)? Or would I need to loop this index by date and then chunk the IDs similar as proposed in 2 and repeat the procedure for each group where I need. Alternatively, (a) I could add to each groups table rank columns, but it seems extremely inefficient in terms of disk-storage. Note, the number of variables where rank filtering is relevant is limited (say 5). Or (b) I could simply use the df_rank received from the rank_var1 query and use in-memory operations via df_rank.merge(df_tmp, left_index=True, right_index=True, how='left') and loop through groups (df_tmp) where I select the desired columns. Say I have some data in different frequencies. Having different group_maps (or different storages) for different freq is the way to go I suppose? Copies of the storage might be used on win/ux systems. I assume it is perfectly compatible, anything to consider here? I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9, complib='blosc'). Any concerns regarding complevel or complib? I've started to write up some code, once I have something to show I'll edit and add it if desired. Please, let me know if you need any more information. EDIT I here a first version of my storage class, please adjust path at bottom accordingly. Sorry for the length of the code, comments welcome import pandas as pd import numpy as np import string class LargeDFStorage(): # TODO add index features to ensure correct indexes # index_names = ('date', 'id') def __init__(self, h5_path, groups_map): """ Parameters ---------- h5_path: str hdf5 storage path groups_map: dict where keys are group_names and values are dict, with at least key 'columns' where the value is list of column names. A special group_name is reserved for group_name/key "query", which can be used as queering and conditioning table when getting data, see :meth:`.get`. """ self.path = str(h5_path) self.groups_map = groups_map self.column_map = self._get_column_map() # if desired make part of arguments self.complib = 'blosc' self.complevel = 9 def _get_column_map(self): """ Calc the inverse of the groups_map/ensures uniqueness of cols Returns ------- dict: with cols as keys and group_names as values """ column_map = dict() for g, value in self.groups_map.items(): if len(set(column_map.keys()) & set(value['columns'])) > 0: raise ValueError('Columns have to be unique') for col in value['columns']: column_map[col] = g return column_map #staticmethod def group_col_names(store, group_name): """ Returns all column names of specific group Parameters ---------- store: pd.HDFStore group_name: str Returns ------- list: of all column names in the group """ if group_name not in store: return [] # hack to get column names, straightforward way!? return store.select(group_name, start=0, stop=0).columns.tolist() #staticmethod def stored_cols(store): """ Collects all columns stored in HDF5 store Parameters ---------- store: pd.HDFStore Returns ------- list: a list of all columns currently in the store """ stored_cols = list() for x in store.items(): group_name = x[0][1:] stored_cols += LargeDFStorage.group_col_names(store, group_name) return stored_cols def _find_groups(self, columns): """ Searches all groups required for covering columns Parameters ---------- columns: list list of valid columns Returns ------- list: of unique groups """ groups = list() for column in columns: groups.append(self.column_map[column]) return list(set(groups)) def add_columns(self, df): """ Adds columns to storage for the first time. If columns should be updated use(use :meth:`.update` instead) Parameters ---------- df: pandas.DataFrame with new columns (not yet stored in any of the tables) Returns ------- """ store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel, complib=self.complib) # check if any column has been stored already if df.columns.isin(self.stored_cols(store)).any(): store.close() raise ValueError('Some cols are already in the store') # find all groups needed to store the data groups = self._find_groups(df.columns) for group in groups: v = self.groups_map[group] # select columns of current group in df select_cols = df.columns[df.columns.isin(v['columns'])].tolist() tmp = df.reindex(columns=select_cols, copy=False) # set data column to False only in case of query data dc = None if group=='query': dc = True stored_cols = self.group_col_names(store,group) # no columns in group (group does not exists yet) if len(stored_cols)==0: store.append(group, tmp, data_columns=dc) else: # load current disk data to memory df_grp = store.get(group) # remove data from disk store.remove(group) # add new column(s) to df_disk df_grp = df_grp.merge(tmp, left_index=True, right_index=True, how='left') # save old data with new, additional columns store.append(group, df_grp, data_columns=dc) store.close() def _query_table(self, store, columns, where): """ Selects data from table 'query' and uses where expression Parameters ---------- store: pd.HDFStore columns: list desired data columns where: str a valid select expression Returns ------- """ query_cols = self.group_col_names(store, 'query') if len(query_cols) == 0: store.close() raise ValueError('No data to query table') get_cols = list(set(query_cols) & set(columns)) if len(get_cols) == 0: # load only one column to minimize memory usage df_query = store.select('query', columns=query_cols[0], where=where) add_query = False else: # load columns which are anyways needed already df_query = store.select('query', columns=get_cols, where=where) add_query = True return df_query, add_query def get(self, columns, where=None): """ Retrieve data from storage Parameters ---------- columns: list/str list of columns to use, or use 'all' if all columns should be retrieved where: str a valid select statement Returns ------- pandas.DataFrame with all requested columns and considering where """ store = pd.HDFStore(str(self.path), mode='r') # get all columns in stored in HDFStorage stored_cols = self.stored_cols(store) if columns == 'all': columns = stored_cols # check if all desired columns can be found in storage if len(set(columns) - set(stored_cols)) > 0: store.close() raise ValueError('Column(s): {}. not in storage'.format( set(columns)- set(stored_cols))) # get all relevant groups (where columns are taken from) groups = self._find_groups(columns) # if where query is defined retrieve data from storage, eventually # only index of df_query might be used if where is not None: df_query, add_df_query = self._query_table(store, columns, where) else: df_query, add_df_query = None, False # dd collector df = list() for group in groups: # skip in case where was used and columns used from if where is not None and group=='query': continue # all columns which are in group but also requested get_cols = list( set(self.group_col_names(store, group)) & set(columns)) tmp_df = store.select(group, columns=get_cols) if df_query is None: df.append(tmp_df) else: # align query index with df index from storage df_query, tmp_df = df_query.align(tmp_df, join='left', axis=0) df.append(tmp_df) store.close() # if any data of query should be added if add_df_query: df.append(df_query) # combine all columns df = pd.concat(df, axis=1) return df def update(self, df): """ Updates data in storage, all columns have to be stored already in order to be accepted for updating (use :meth:`.add_columns` instead) Parameters ---------- df: pd.DataFrame with index as in storage, and column as desired Returns ------- """ store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel, complib=self.complib) # check if all column have been stored already if df.columns.isin(self.stored_cols(store)).all() is False: store.close() raise ValueError('Some cols have not been stored yet') # find all groups needed to store the data groups = self._find_groups(df.columns) for group in groups: dc = None if group=='query': dc = True # load current disk data to memory group_df = store.get(group) # remove data from disk store.remove(group) # update with new data group_df.update(df) # save updated df back to disk store.append(group, group_df, data_columns=dc) store.close() class DataGenerator(): np.random.seed(1282) #staticmethod def get_df(rows=100, cols=10, freq='M'): """ Simulate data frame """ if cols < 26: col_name = list(string.ascii_lowercase[:cols]) else: col_name = range(cols) if rows > 2000: freq = 'Min' index = pd.date_range('19870825', periods=rows, freq=freq) df = pd.DataFrame(np.random.standard_normal((rows, cols)), columns=col_name, index=index) df.index.name = 'date' df.columns.name = 'ID' return df #staticmethod def get_panel(rows=1000, cols=500, items=10): """ simulate panel data """ if items < 26: item_names = list(string.ascii_lowercase[:cols]) else: item_names = range(cols) panel_ = dict() for item in item_names: panel_[item] = DataGenerator.get_df(rows=rows, cols=cols) return pd.Panel(panel_) def main(): # Example of with DataFrame path = 'D:\\fc_storage.h5' groups_map = dict( a=dict(columns=['a', 'b', 'c', 'd', 'k']), query=dict(columns=['e', 'f', 'g', 'rank_a']), ) storage = LargeDFStorage(path, groups_map=groups_map) df = DataGenerator.get_df(rows=200000, cols=15) storage.add_columns(df[['a', 'b', 'c', 'e', 'f']]) storage.update(df[['a']]*3) storage.add_columns(df[['d', 'g']]) print(storage.get(columns=['a','b', 'f'], where='f<0 & e<0')) # Example with panel and rank condition path2 = 'D:\\panel_storage.h5' storage_pnl = LargeDFStorage(path2, groups_map=groups_map) panel = DataGenerator.get_panel(rows=800, cols=2000, items=24) df = panel.to_frame() df['rank_a'] = df[['a']].groupby(level='date').rank() storage_pnl.add_columns(df[['a', 'b', 'c', 'e', 'f']]) storage_pnl.update(df[['a']]*3) storage_pnl.add_columns(df[['d', 'g', 'rank_a']]) print(storage_pnl.get(columns=['a','b','e', 'f', 'rank_a'], where='f>0 & e>0 & rank_a <100')) if __name__ == '__main__': main()
It's bit difficult to answer those questions without particular examples... Updating HDFTable, I suppose I have to delete the entire table in order to update a single column? AFAIK yes unless you are storing single columns separately, but it will be done automatically, you just have to write your DF/Panel back to HDF Store. When to use 'data_columns', or should I simply assign True in HDFStore.append()? data_columns=True - will index all your columns - IMO it's waste of resources unless you are going to use all columns in the where parameter (i.e. if all columns should be indexed). I would specify there only those columns that will be used often for searching in where= clause. Consider those columns as indexed columns in a database table. If I want to query based on condition of rank_var1 > 500 & rank_var1<1000, but I need columns from other groups. Can I enter the index received from the rank_var1 condition into the query to get other columns based on this index (the index is a multi-index with date and ID)? I think we would need some reproducible sample data and examples of your queries in order to give a reasonable answer... Copies of the storage might be used on win/ux systems. I assume it is perferctly compatible, anything to consider here? Yes, it should be fully compatible I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9, complib='blosc'). Any concerns regarding complevel or complib? Test it with your data - results might depend on dtypes, number of unique values, etc. You may also want to consider lzo complib - it might be faster in some use-cases. Check this. Sometimes a high complevel doesn't give you better copression ratio, but will be slower (see results of my old comparison)