Pandas dataframe to Object instances array efficiency for bulk DB insert - python

I have a Pandas dataframe in the form of:
Time Temperature Voltage Current
0.0 7.8 14 56
0.1 7.9 12 58
0.2 7.6 15 55
... So on for a few hundred thousand rows...
I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.
My data model looks like this:
class Data(models.Model):
time = models.DateTimeField(db_index=True)
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
parameter_value = models.FloatField()
So Time is row[0] of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter. So row[0] of the example table would generate 3 Data objects in my database:
Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)
Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.
My current solution is to go through each row, create one Data object for each row on time + parameter + value and append said object to an array so I can Data.objects.bulk_create(all_data_objects) through Django. Of course I am aware that this is inefficient and could probably be improved a lot.
Using this code:
# Convert DataFrame to dict
df_records = df.to_dict('records')
# Start empty dta array
all_data_objects = []
# Go through each row creating objects and appending to data array
for row in df_records:
for parameter, parameter_value in row.items():
if parameter != "Time":
all_data_objects.append(Data(
time=row["Time"],
parameter_value=parameter_value,
parameter=parameter))
# Commit data to Postgres DB
Data.objects.bulk_create(all_data)
Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data objects array, for a 55mb file that generates about 6 million individual Data objects takes around 370 seconds. Just the df_records = df.to_dict('records') line takes 83ish seconds. Times were measured using time.time() at both ends of each section and calculating the difference.
How can I improve these times?

If you really need a fast solution I suggest you dumb the table directly using pandas.
First let's create the data for your example:
import pandas as pd
data = {
'Time': {0: 0.0, 1: 0.1, 2: 0.2},
'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
'Voltage': {0: 14, 1: 12, 2: 15},
'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)
Now you should transform the dataframe so that you have the desired columns with melt:
df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")
At this point you should map the parameter values to the foreign id. I will use params as an example:
params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)
At this point the dataframe will look like:
Time parameter parameter_value
0 0.0 1 7.8
1 0.1 1 7.9
2 0.2 1 7.6
3 0.0 2 14.0
4 0.1 2 12.0
5 0.2 2 15.0
6 0.0 3 56.0
7 0.1 3 58.0
8 0.2 3 55.0
And now to export using pandas you can use:
import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)
However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from insted since is faster:
from io import StringIO
output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)
# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
connection.commit()
We tried this for a few millions and it was the fastest way when using PostgreSQL.

You don't need to create Data object for all rows. SqlAlchemy also supports bulk insert in this way:
data.insert().values([
dict(time=0.0, parameter="Temperature", parameter_value=7.8),
dict(time=0.0, parameter="Voltage", parameter_value=14)
])
See https://docs.sqlalchemy.org/en/13/core/dml.html?highlight=insert%20values#sqlalchemy.sql.expression.ValuesBase.values for more details.
If you only need to insert the data, you don't need pandas and can use a different parsers for your datafile (or write your own, depending on format ot your datafile). Also, it would probably make sense to split the dataset into smaller parts and parallelize the insert command.

Related

Convert win32com.client Range to Pandas Dataframe?

I am writing some macros that call Python code to perform operations on ranges in Excel. It is much easier to do a lot of the required operations with pandas in Python. Because I want to do this while the spreadsheet is open (and may not have been saved), I am using the win32com.client to read in a range of cells to convert to a Pandas dataframe. However, this is extremely slow, presumably because the way I calculate it is very inefficient:
import datetime
import pytz
import pandas
import time
import win32com.client
def range_to_table(excelRange, tsy, tsx, height, width, add_cell_refs = True):
ii = 0
keys = []
while ii < width:
keys.append(str(excelRange[ii]))
ii += 1
colnumbers = {key:jj+tsx for jj, key in enumerate(keys)}
keys.append('rownumber')
mydict = {key:[] for key in keys}
while ii < width*height:
mydict[keys[ii%width]].append(excelRange[ii].value)
ii += 1
for yy in range(tsy + 1, tsy + 1 + height - 1): # add 1 to not include header
mydict['rownumber'].append(yy)
return (mydict, colnumbers)
ExcelApp = win32com.client.GetActiveObject('Excel.Application')
wb = ExcelApp.Workbooks('myworkbook.xlsm')
sheet_num = [sheet.Name for sheet in wb.Sheets].index("myworksheet name") + 1
ws = wb.Worksheets(sheet_num)
height = int(ws.Cells(1, 3)) # obtain table height from formula in excel spreadsheet
width = int(ws.Cells(1, 2)) # obtain table width from formula in excel spreadsheet
myrange = ws.Range(ws.Cells(2, 1), ws.Cells(2 + height - 1, 1 + width - 1))
df, colnumbers = range_to_table(myrange, 1, 1, height, width)
df = pandas.DataFrame.from_dict(df)
This works, but the range_to_table function I wrote is extremely slow for large tables since it iterates over each cell one by one.
I suspect there is probably a much better way to convert the Excel Range object to a Pandas dataframe. Do you know of a better way?
Here a simplified example of what my range would look like:
The height and width variables in the code are just taken from cells immediately above the table:
Any ideas here, or am I just going to have to save the workbook and use Pandas to read in the table from the saved file?
There are two parts to the operation: defining the spreadsheet range and then getting the data into Python. Here is the test data that I'm working with:
1. Defining the range: Excel has a feature called Dynamic Ranges. This allows you to give a name to a range whose extent is variable.
I've set up a dynamic range called 'DynRange', and you can see that it uses the row and column counts from $C$1 and $C$2 to define the size of the array.
Once you have this definition, the range can be used by Name in Python, and saves you having to access the row and column count explicitly.
2. Using this range in Python via win32.com: Once you have defined the name in Excel, handling it in Python is much easier.
import win32com.client as wc
import pandas as pd
#Create a dispatch interface
xl = wc.gencache.EnsureDispatch('Excel.Application')
filepath = 'SomeFilePath\\TestBook.xlsx'
#Open the workbook
wb = xl.Workbooks.Open(filepath)
#Get the Worksheet by name
ws = wb.Sheets('Sheet1')
#Use the Value property to get all the data in the range
listVals = ws.Range('DynRange').Value
#Construct the dataframe, using first row as headers
df = pd.DataFrame(listVals[1:],columns=listVals[0])
#Optionally process the datetime value to avoid tz warnings
df['Datetime'] = df['Datetime'].dt.tz_convert(None)
print(df)
wb.Close()
Output:
Datetime Principal Source Amt Cost Basis
0 2021-04-21 04:59:00 -5.0 1.001 5.0
1 2021-04-25 15:16:00 -348.26 1.001 10.0
2 2021-04-29 11:04:00 0.0 1.001 5.0
3 2021-04-29 21:26:00 0.0 1.001 5.0
4 2021-04-29 23:39:00 0.0 1.001 5.0
5 2021-05-02 14:00:00 -2488.4 1.001 5.0
As the OP suspects, iterating over the range cell-by-cell performs slowly. The COM infrastructure has to do a good deal of processing to pass data from one process (Excel) to another (Python). This is known as 'marshalling'. Most of the time is spent packing up the variables on one side and unpacking on the other. It is much more efficient to marshal the entire contents of an Excel Range in one go (as a 2D array) and Excel allows this by exposing the Value property on the Range as a whole, rather than by cell.
You can try using multiprocessing for this. You could have each worker scanning a different column for example, or even do the same on the lines.
Minor changes to your code are needed:
Create a function iterating over the columns and storing the information in a dict
Use the simple multiprocessing example # https://pymotw.com/2/multiprocessing/basics.html
Create a function appending all different dicts created by each worker into a single one
That should divide your compute time by the amount of workers used.

How to read through large csv or database and join columns when memory is an issue?

I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).
So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.
Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.
from operator import itemgetter as iget
d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
.drop_duplicates('NPI')\
.set_index('NPI')[['LastName', 'FirstName']]\
.to_dict(orient='index')
# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
# 1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}
df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))
# DOS Readmit SurgeonNPI LastName FirstName
# 0 1-1-2018 1 1184809691 GOLDMAN SALUJA
# 1 2-2-2018 0 1184809691 GOLDMAN SALUJA
# 2 2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If your final dataframe is too large to store in memory, then I would consider these options:
Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.
I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.
It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.
Sampling!
import pandas as pd
import random
n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)
Then this should work fine
summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.

How to Import multiple CSV files then make a Master Table?

I am a research chemist and have carried out a measurement where I record 'signal intensity' vs 'mass-to-charge (m/z)' . I have repeated this experiment 15x, by changing a specific parameter (Collision Energy). As a result, I have 15 CSV files and would like to align/join them within the same range of m/z values and same interval values. Due to the instrument thresholding rules, certain m/z values were not recorded, thus I have files that cannot simply be exported into excel and copy/pasted. The data looks a bit like the tables posted below
Dataset 1: x | y Dataset 2: x | y
--------- ---------
0.0 5 0.0 2
0.5 3 0.5 6
2.0 7 1.0 9
3.0 1 2.5 1
3.0 4
Using matlab I started with this code:
%% Create a table for the set m/z range with an interval of 0.1 Da
mzrange = 50:0.1:620;
mzrange = mzrange';
mzrange = array2table(mzrange,'VariableNames',{'XThompsons'});
Then I manually imported 1 X/Y CSV (Xtitle=XThompson, Ytitle=YCounts) to align with the specified m/z range.
%% Join/merge the two tables using a common Key variable 'XThompson' (m/z value)
mzspectrum = outerjoin(mzrange,ReserpineCE00,'MergeKeys',true);
% Replace all NaN values with zero
mzspectrum.YCounts(isnan(mzspectrum.YCounts)) = 0;
At this point I am stuck because repeating this process with a separate file will overwrite my YCounts column. The title of the YCounts column doesnt matter to me as I can change it later, however I would like to have the table continue as such:
XThompson | YCounts_1 | YCounts_2 | YCounts_3 | etc...
--------------------------------------------------------
How can I carry this out in Matlab so that this is at least semi-automated? I've had posted earlier describing a similar scenario but it turned out that it could not carry out what I need. I must admit that my mind is not of a programmer so I have been struggling with this problem quite a bit.
PS- Is this problem best executed in Matlab or Python?
I don't know or use matlab so my answer is pure python based. I think python and matlab should be equally well suited to read csv files and generate a master table.
Please consider this answer more as pointer to how to address the problem in python.
In python one would typically address this problem using the pandas package. This package provides "high-performance, easy-to-use data structures and data analysis tools" and can read natively a large set of file formats including CSV files. A master table from two CSV files "foo.csv" and "bar.csv" could be generated e.g. as follows:
import pandas as pd
df = pd.read_csv('foo.csv')
df2 = pd.read_csv('bar.cvs')
master_table = pd.concat([df, df2])
Pandas further allows to group and structure the data in many ways. The pandas documentation has very good descriptions of its various features.
One can install pandas with the python package installer pip:
sudo pip install pandas
if on Linux or OSX.
The counts from the different analyses should be named differently, i.e., YCounts_1, YCounts_2, and YCounts_3 from analyses 1, 2, and 3, respectively, in the different datasets before joining them. However, the M/Z name (i.e., XThompson) should be the same since this is the key that will be used to join the datasets. The code below is for MATLAB.
This step is not needed (just recreates your tables) and I copied dataset2 to create dataset3 for illustration. You could use 'readtable' to import your data i.e., imported_data = readtable('filename');
dataset1 = table([0.0; 0.5; 2.0; 3.0], [5; 3; 7; 1], 'VariableNames', {'XThompson', 'YCounts_1'});
dataset2 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_2'});
dataset3 = table([0.0; 0.5; 1.0; 2.5; 3.0], [2; 6; 9; 1; 4], 'VariableNames', {'XThompson', 'YCounts_3'});
Merge tables using outerjoin. You could use loop if you have many datasets.
combined_dataset = outerjoin(dataset1,dataset2, 'MergeKeys', true);
Add dataset3 to the combined_dataset
combined_dataset = outerjoin(combined_dataset,dataset3, 'MergeKeys', true);
You could export the combined data as Excel Sheet by using writetable
writetable(combined_dataset, 'joined_icp_ms_data.xlsx');
I managed to create a solution to my problem based on learning through everyone's input and taking an online matlab courses. I am not a natural coder so my script is not as elegant as the geniuses here, but hopefully it is clear enough for other non-programming scientists to use.
Here's the result that works for me:
% Reads a directory containing *.csv files and corrects the x-axis to an evenly spaced (0.1 unit) interval.
% Create a matrix with the input x range then convert it to a table
prompt = 'Input recorded min/max data range separated by space \n(ex. 1 to 100 = 1 100): ';
inputrange = input(prompt,'s');
min_max = str2num(inputrange)
datarange = (min_max(1):0.1:min_max(2))';
datarange = array2table(datarange,'VariableNames',{'XAxis'});
files = dir('*.csv');
for q=1:length(files);
% Extract each XY pair from the csvread cell and convert it to an array, then back to a table.
data{q} = csvread(files(q).name,2,1);
data1 = data(q);
data2 = cell2mat(data1);
data3 = array2table(data2,'VariableNames',{'XAxis','YAxis'});
% Join the datarange table and the intensity table to obtain an evenly spaced m/z range
data3 = outerjoin(datarange,data3,'MergeKeys',true);
data3.YAxis(isnan(data3.YAxis)) = 0;
data3.XAxis = round(data3.XAxis,1);
% Remove duplicate values
data4 = sortrows(data3,[1 -2]);
[~, idx] = unique(data4.XAxis);
data4 = data4(idx,:);
% Save the file as the same name in CSV without underscores or dashes
filename = files(q).name;
filename = strrep(filename,'_','');
filename = strrep(filename,'-','');
filename = strrep(filename,'.csv','');
writetable(data4,filename,'FileType','text');
clear data data1 data2 data3 data4 filename
end
clear

Compute values from sequential pandas rows

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:
In [1]: movements.head()
Out[1]:
Typedoc Port NRT GRT Draft
Vessname ECDate
400 L 2012-01-19 0 2394 2328 7762 4.166667
2012-07-22 1 2394 2328 7762 17.000000
2012-10-29 0 2395 2328 7762 6.000000
A 397 2012-05-27 1 3315 2928 2928 18.833333
2012-06-01 0 3315 2928 2928 5.250000
I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:
stops_dict = {'400 L':[
['2012-01-19', 0, 2394, 4.166667],
['2012-07-22', 1, 2394, 17.000000],
['2012-10-29', 0, 2395, 6.000000]
]
}
Where the nested list values are:
[ECDate, Typedoc, Port, Draft]
If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:
t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]
i += 1
and
t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]
assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:
link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft
The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.
I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:
with open(filename,'sb) as csvfile:
datareader = csv.reader(csvfile, delimiter=",")
next(datareader, None)
<FLOW CONTROL> #based on Typedoc and ECDate values
The function adds to an empty dictionary:
stops_dict = {}
def createStopsDict(row):
#this reads each row in a csv file,
#creates a dict entry from row[0]: Vessname if not in dict
#or appends things after row[0] to the dict entry if Vessname in dict
ves = row[0]
if ves in stops_dict:
stops_dict[ves].append(row[1:])
else:
stops_dict[ves]=[row[1:]]
return
This is an inefficient way of doing things...
I could possibly be using iterrows instead of a csv reader...
I've looked into melt and unstack and I don't think those are correct...
This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...
Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).
I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.
Thanks!
UPDATE 2:
I think I have this mostly figured out...
Beginning with my original data frame movements:
movements.reset_index().apply(
lambda x: makeRoute(x.Vessname,
[x.ECDate,
x.Typedoc,
x.Port,
x.NRT,
x.GRT,
x.Draft]),
axis=1
)
where:
routemap = {}
def makeRoute(Vessname, info):
if Vessname in routemap:
route = routemap[Vessname]
route.append(info)
else:
routemap[Vessname] = [info]
return
returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

Pandas bottleneck, quicker way of slicing?

On an 8 core, 14GB instance, a similar job like this job took ~ 2 weeks to complete and cost a chunk of change, hence any help with with a speed up will be greatly appreciated.
I have an SQL table with ~ 6.6 million rows, two columns, integers in each. The integers denote pandas data-frame locations (bear with me, populating these frame locations is purely to take off some processing time, really not making a dent though):
The integers go up to 26000, and for every integer we look forward in ranges 5-250:
it_starts it_ends
1 5
2 6
...
25,996 26000
...
...
1 6
2 7
...
25,995 26000
...
...
1 7
2 8
...
25,994 26000
If that's not clear enough, the tables were populated with something like this:
chunks = range(5,250)
for chunk_size in chunks:
x = np.array(range(1,len(df)-chunk_size))
y = [k+chunk_size for k in x]
rng_tup = zip(x,y)
#do table inserts
I use this table, as I have said, to take slices from a pandas dataframe, with the following:
rngs = c.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
Where I have used the following pandas code for 'df_frame', and db is the database in question:
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
Hence putting it all together together for clarity:
import sqlite3
import pandas.io.sql as psql
import pandas as pd
db= sqlite3.connect("HIST.db")
c = db.cursor()
c.execute("PRAGMA synchronous = OFF")
c.execute("PRAGMA journal_mode = OFF")
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
rngs = c_rng.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
for i in xrange(0,len(fms)):
#perform trivial (ish) operations on slice,
#trivial compared to the overall time that is.
So as you probably guessed, 'df_frame[it_sts:it_end]' is causing a massive bottleneck as it needs to create ~ 6m slices (*40 separate databases in total)), hence I think it's wise to invest a little time here in asking the question; before I throw more money at it, am I making any cardinal errors here? Is there anything anyone can suggest as a speed up? Thanks.

Categories