On an 8 core, 14GB instance, a similar job like this job took ~ 2 weeks to complete and cost a chunk of change, hence any help with with a speed up will be greatly appreciated.
I have an SQL table with ~ 6.6 million rows, two columns, integers in each. The integers denote pandas data-frame locations (bear with me, populating these frame locations is purely to take off some processing time, really not making a dent though):
The integers go up to 26000, and for every integer we look forward in ranges 5-250:
it_starts it_ends
1 5
2 6
...
25,996 26000
...
...
1 6
2 7
...
25,995 26000
...
...
1 7
2 8
...
25,994 26000
If that's not clear enough, the tables were populated with something like this:
chunks = range(5,250)
for chunk_size in chunks:
x = np.array(range(1,len(df)-chunk_size))
y = [k+chunk_size for k in x]
rng_tup = zip(x,y)
#do table inserts
I use this table, as I have said, to take slices from a pandas dataframe, with the following:
rngs = c.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
Where I have used the following pandas code for 'df_frame', and db is the database in question:
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
Hence putting it all together together for clarity:
import sqlite3
import pandas.io.sql as psql
import pandas as pd
db= sqlite3.connect("HIST.db")
c = db.cursor()
c.execute("PRAGMA synchronous = OFF")
c.execute("PRAGMA journal_mode = OFF")
with db:
sqla =("SELECT ctime, Date, Time, high, low FROM quote_meta)
df =psql.read_sql(sqla, db)
df_frame = df.iloc[:,[0,1,2,3,4]]
rngs = c_rng.execute("SELECT it_starts,it_ends FROM rngtab").fetchall()
for k in rngs:
it_sts=k[0]
it_end=k[1]
fms = df_frame[it_sts:it_end]
for i in xrange(0,len(fms)):
#perform trivial (ish) operations on slice,
#trivial compared to the overall time that is.
So as you probably guessed, 'df_frame[it_sts:it_end]' is causing a massive bottleneck as it needs to create ~ 6m slices (*40 separate databases in total)), hence I think it's wise to invest a little time here in asking the question; before I throw more money at it, am I making any cardinal errors here? Is there anything anyone can suggest as a speed up? Thanks.
Related
I have a data table which has two columns time in and time out as shown below.
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
I need to create a third column named 'Counter' which updates in a way that when the TimeIn of ith occurance is more than TimeOut of (i-1)th then that counter remains same else increases to 1. Consider it as people assigned for task so if a person is free after his/her time out then he/she can take up the job. Also if at a particular instance more than one counter is free then I need to take the first of them which got free so the above table would look like this.
TimeIn TimeOut Counter
01:23AM 01:45AM 1
01:34AM 01:53AM 2
01:43AM 01:59AM 3
02:01AM 02:09AM 1 (in this case 1,2,3 all are also free but 1 became free first)
02:34AM 03:11AM 2 (in this case 1,2,3 all are also free but 2 became free first)
02:39AM 02:48AM 3 (in this case 1 is also free but 3 became free first)
02:56AM 03:12AM 1 (in this case 3 is also free but 1 became free first)
I was hoping if there could be a way in pandas to do it without loop since my database could be large but please let me know even if there is a way where it could be achieved efficiently using a loop as well should be fine.
Many thanks in advance.
I couldn't figure out an efficient way with native Pandas-methods. But if I'm not completely mistaken, a heap queue seems to be an adequate tool for the problem.
With
df =
TimeIn TimeOut
0 01:23AM 01:45AM
1 01:34AM 01:53AM
2 01:43AM 01:59AM
3 02:01AM 02:09AM
4 02:34AM 03:11AM
5 02:39AM 02:48AM
6 02:56AM 03:12AM
and
for col in ("TimeIn", "TimeOut"):
df[col] = pd.to_datetime(df[col])
this
from heapq import heappush, heappop
w_count = 1
counter = [1]
heap = []
w_time_out, w = df.TimeOut[0], 1
for time_in, time_out in zip(
df.TimeIn.tolist()[1:], df.TimeOut.tolist()[1:]
):
if time_in > w_time_out:
heappush(heap, (time_out, w))
counter.append(w)
w_time_out, w = heappop(heap)
else:
w_count += 1
counter.append(w_count)
if time_out > w_time_out:
heappush(heap, (time_out, w_count))
else:
heappush(heap, (w_time_out, w))
w_time_out, w = time_out, w_count
produces the counter-list
[1, 2, 3, 1, 2, 3, 1]
Regarding your input data: You don't have complete timestamps, so pd.to_datetime uses the current day as date part. So if the range of your times isn't contained in one day you'll run into trouble.
EDIT: Fixed a mistake in the last else-branch.
For the sake of completeness, I'm including a pandas/numpy based solution. Performance is roughly 3x better (I saw 12s vs 34s for 10 million records) than the heapq based one, but implementation is significantly harder to follow. Unless you really need the performance, I'd recommend #Timus solution.
The idea here is:
We identify sessions where we have to increment the counter. We can immediately assign counter values to these sessions.
For the remaining sessions, we create a sequence of sessions that the same worker handles. We can then map any session to a "root session" where the worker was created.
To accomplish step (2):
We get two lists of session IDs, one sorted by start time and the other end time.
Pair each session start with the least recent session end. This corresponds to the earliest available worker taking on the next incoming request.
Work up the tree to map any given session to the first session handled by that worker.
# setup
text = StringIO(
"""
TimeIn TimeOut
01:23AM 01:45AM
01:34AM 01:53AM
01:43AM 01:59AM
02:01AM 02:09AM
02:34AM 03:11AM
02:39AM 02:48AM
02:56AM 03:12AM
""".strip()
)
sessions = pd.read_csv(text, sep=" ", parse_dates=["TimeIn", "TimeOut"])
# transform the data from wide format to long format
# event_log has the following columns:
# - Session: corresponding to the index of the input data
# - EventType: either TimeIn or TimeOut
# - EventTime: the event's time value
event_log = pd.melt(
sessions.rename_axis(index="Session").reset_index(),
id_vars=["Session"],
value_vars=["TimeIn", "TimeOut"],
var_name="EventType",
value_name="EventTime",
)
# sort the entire log by time
event_log.sort_values("EventTime", inplace=True, kind="mergesort")
# concurrency is the number of active workers at the time of that log entry
concurrency = event_log["EventType"].replace({"TimeIn": 1, "TimeOut": -1}).cumsum()
# new workers occur when the running maximum concurrency increases
new_worker = concurrency.cummax().diff().astype(bool)
new_worker_sessions = event_log.loc[new_worker, "Session"]
root_session = np.empty_like(sessions.index)
root_session[new_worker_sessions] = new_worker_sessions
# we could use the `sessions` DataFrame to avoid searching, but we'd need to sort on TimeOut
new_session = event_log.query("~#new_worker & (EventType == 'TimeIn')")["Session"]
old_session = event_log.query("~#new_worker & (EventType == 'TimeOut')")["Session"]
# Pair each session start with the session that ended least recently
root_session[new_session] = old_session[: new_session.shape[0]]
# Find the root session
# maybe something can be optimized here?
while not np.array_equal((_root_session := root_session.take(root_session)), root_session):
root_session = _root_session
counter = np.empty_like(root_session)
counter[new_worker_sessions] = np.arange(start=1, stop=new_worker_sessions.shape[0] + 1)
sessions["Counter"] = counter.take(root_session)
Quick bit of code to generate more fake data:
N = 10 ** 6
start = pd.Timestamp("2021-08-12T01:23:00")
_base = pd.date_range(start=start, periods=N, freq=pd.Timedelta(1, "seconds"))
time_in = (
_base.values
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "ms")
)
time_out = (
time_in
+ np.random.exponential(10, size=N).astype("timedelta64[s]")
+ np.random.exponential(1000, size=N).astype("timedelta64[ms]")
+ np.random.exponential(10000, size=N).astype("timedelta64[ns]")
+ np.timedelta64(1, "s")
)
sessions = (
pd.DataFrame({"TimeIn": time_in, "TimeOut": time_out})
.sort_values("TimeIn")
.reset_index(drop=True)
)
Preamble:
I am working with L2 tick data.
The bid/offer will not necessarily be balanced in terms of number of levels
The number of levels could range from 0 to 20.
I want to save the full book to disk every time it is updated
I believe I want to use numpy array such that I can use h5py/vaex to perform offline data processing.
I'll ideally be writing (appending) to disk every x updates or on a timer.
If we assume an example book looks like this:
array([datetime.datetime(2017, 11, 6, 14, 57, 8, 532152), # book creation time
array(['20171106-14:57:08.528', '20171106-14:57:08.428'], dtype='<U21'), # quote entry (bid)
array([1.30699, 1.30698]), # quote price (bid)
array([100000., 250000.]), # quote size (bid)
array(['20171106-14:57:08.528'], dtype='<U21'), # quote entry (offer)
array([1.30709]), # quote price (offer)
array([100000.])], # quote size (offer)
dtype=object)
Numpy doesnt like the jagged-ness of the array, and whilst I'm happy (enough) to use np.pad to pad the times/prices/sizes to a length of 20, I don't think I want to be creating an array for the book creation time.
Could/should I be going about this differently? Ultimately I'll want to do asof joins against the a list of trades hence I'd like a column-store approach. How is everyone else doing this? Are they storing multiple rows? or multiple columns?
EDIT:
I want to be able to do something like:
with h5py.File("foo.h5", "w") as f:
f.create_dataset(data=my_np_array)
and then later perform an asof join between my hdf5 tickdata and a dataframe of trades.
EDIT2:
In KDB the entry would look like:
q)t:([]time:2017.11.06D14:57:08.528;sym:`EURUSD;bid_time:enlist 2017.11.06T14:57:08.528 20171106T14:57:08.428;bid_px:enlist 1.30699, 1.30698;bid_size:enlist 100000. 250000.;ask_time:enlist 2017.11.06T14:57:08.528;ask_px:enlist 1.30709;ask_size:enlist 100000.)
q)t
time sym bid_time bid_px bid_size ask_time ask_px ask_size
-----------------------------------------------------------------------------------------------------------------------------------------------------------
2017.11.06D14:57:08.528000000 EURUSD 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428 1.30699 1.30698 100000 250000 2017.11.06T14:57:08.528 1.30709 100000
q)first t
time | 2017.11.06D14:57:08.528000000
sym | `EURUSD
bid_time| 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428
bid_px | 1.30699 1.30698
bid_size| 100000 250000f
ask_time| 2017.11.06T14:57:08.528
ask_px | 1.30709
ask_size| 100000f
EDIT3:
Should I just give in with the idea of a nested column and have 120 columns (20*(bid_times+bid_prices+bid_sizes+ask_times+ask_prices+ask_sizes)? Seems excessive, and unwieldy to work with...
For anyone is stumbling across this ~2 years later, I have recently revisited this code and have swapped out h5py for pyarrow+parquet.
This means I can create a schema with nested columns and read that back into a pandas DataFrame with ease:
import pyarrow as pa
schema = pa.schema([
("Time", pa.timestamp("ns")),
("Symbol", pa.string()),
("BidTimes", pa.list_(pa.timestamp("ns"))),
("BidPrices", pa.list_(pa.float64())),
("BidSizes", pa.list_(pa.float64())),
("BidProviders", pa.list_(pa.string())),
("AskTimes", pa.list_(pa.timestamp("ns"))),
("AskPrices", pa.list_(pa.float64())),
("AskSizes", pa.list_(pa.float64())),
("AskProviders", pa.list_(pa.string())),
])
In terms of streaming the data to disk, I use pq.ParquetWriter.write_table - keeping track of open filehandles (one per Symbol) so that I can append to the file, only closing (and thus writing metadata) when I'm done.
Rather than streaming pyarrow tables, I stream regular Python dictionaries, creating a Pandas DataFrame when I hit a given size (e.g. 1024 rows) which I then pass to the ParquetWriter to write down.
I have a Pandas dataframe in the form of:
Time Temperature Voltage Current
0.0 7.8 14 56
0.1 7.9 12 58
0.2 7.6 15 55
... So on for a few hundred thousand rows...
I need to bulk insert the data into a PostgreSQL database, as fast as possible. This is for a Django project, and I'm currently using the ORM for DB operations and building queries, but open to suggestions if there are more efficient ways to accomplish the task.
My data model looks like this:
class Data(models.Model):
time = models.DateTimeField(db_index=True)
parameter = models.ForeignKey(Parameter, on_delete=models.CASCADE)
parameter_value = models.FloatField()
So Time is row[0] of the DataFrame, and then for each header column, I grab the value that corresponds to it, using the header as parameter. So row[0] of the example table would generate 3 Data objects in my database:
Data(time=0.0, parameter="Temperature", parameter_value=7.8)
Data(time=0.0, parameter="Voltage", parameter_value=14)
Data(time=0.0, parameter="Current", parameter_value=56)
Our application allows the user to parse data files that are measured in milliseconds. So we generate a LOT of individual data objects from a single file. My current task is to improve the parser to make it much more efficient, until we hit I/O constraints on a hardware level.
My current solution is to go through each row, create one Data object for each row on time + parameter + value and append said object to an array so I can Data.objects.bulk_create(all_data_objects) through Django. Of course I am aware that this is inefficient and could probably be improved a lot.
Using this code:
# Convert DataFrame to dict
df_records = df.to_dict('records')
# Start empty dta array
all_data_objects = []
# Go through each row creating objects and appending to data array
for row in df_records:
for parameter, parameter_value in row.items():
if parameter != "Time":
all_data_objects.append(Data(
time=row["Time"],
parameter_value=parameter_value,
parameter=parameter))
# Commit data to Postgres DB
Data.objects.bulk_create(all_data)
Currently the entire operation, without the DB insert operation included (writing to disk), that is, just generating the Data objects array, for a 55mb file that generates about 6 million individual Data objects takes around 370 seconds. Just the df_records = df.to_dict('records') line takes 83ish seconds. Times were measured using time.time() at both ends of each section and calculating the difference.
How can I improve these times?
If you really need a fast solution I suggest you dumb the table directly using pandas.
First let's create the data for your example:
import pandas as pd
data = {
'Time': {0: 0.0, 1: 0.1, 2: 0.2},
'Temperature': {0: 7.8, 1: 7.9, 2: 7.6},
'Voltage': {0: 14, 1: 12, 2: 15},
'Current': {0: 56, 1: 58, 2: 55}
}
df = pd.DataFrame(data)
Now you should transform the dataframe so that you have the desired columns with melt:
df = df.melt(["Time"], var_name="parameter", value_name="parameter_value")
At this point you should map the parameter values to the foreign id. I will use params as an example:
params = {"Temperature": 1, "Voltage": 2, "Current": 3}
df["parameter"] = df["parameter"].map(params)
At this point the dataframe will look like:
Time parameter parameter_value
0 0.0 1 7.8
1 0.1 1 7.9
2 0.2 1 7.6
3 0.0 2 14.0
4 0.1 2 12.0
5 0.2 2 15.0
6 0.0 3 56.0
7 0.1 3 58.0
8 0.2 3 55.0
And now to export using pandas you can use:
import sqlalchemy as sa
engine = sa.create_engine("use your connection data")
df.to_sql(name="my_table", con=engine, if_exists="append", index=False)
However when I used that it was not fast enough to meet our requirements. So I suggest you use cursor.copy_from insted since is faster:
from io import StringIO
output = StringIO()
df.to_csv(output, sep=';', header=False, index=False, columns=df.columns)
output.getvalue()
# jump to start of stream
output.seek(0)
# Insert df into postgre
connection = engine.raw_connection()
with connection.cursor() as cursor:
cursor.copy_from(output, "my_table", sep=';', null="NULL", columns=(df.columns))
connection.commit()
We tried this for a few millions and it was the fastest way when using PostgreSQL.
You don't need to create Data object for all rows. SqlAlchemy also supports bulk insert in this way:
data.insert().values([
dict(time=0.0, parameter="Temperature", parameter_value=7.8),
dict(time=0.0, parameter="Voltage", parameter_value=14)
])
See https://docs.sqlalchemy.org/en/13/core/dml.html?highlight=insert%20values#sqlalchemy.sql.expression.ValuesBase.values for more details.
If you only need to insert the data, you don't need pandas and can use a different parsers for your datafile (or write your own, depending on format ot your datafile). Also, it would probably make sense to split the dataset into smaller parts and parallelize the insert command.
I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).
So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.
Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.
from operator import itemgetter as iget
d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
.drop_duplicates('NPI')\
.set_index('NPI')[['LastName', 'FirstName']]\
.to_dict(orient='index')
# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
# 1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}
df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))
# DOS Readmit SurgeonNPI LastName FirstName
# 0 1-1-2018 1 1184809691 GOLDMAN SALUJA
# 1 2-2-2018 0 1184809691 GOLDMAN SALUJA
# 2 2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If your final dataframe is too large to store in memory, then I would consider these options:
Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.
I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.
It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.
Sampling!
import pandas as pd
import random
n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)
Then this should work fine
summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.
I have a python program to deal with big data in one computer(16 cpu cores). Because the data is bigger and bigger, I need it to run in 5 computers. I am fresh in Spark,still feel Confused after read some docs. I will appreciate if anyone can tell me what is the best way to make a small cluster.
Here is some details:
The program is trying to count the trade volume in every price for each stock (one day for a time) from tick transaction pandas dataframe data.
There are more than 3000 stocks, 1 billion transactions in one day. The size of data file(dataframe) is between 1~2 G.
getting the result of 300 days spend for 3 days on one computer now, I hope to add 4 more computers to short the time.
here are the sample code in python:
import sharedmem
import os
import multiprocessing as mp
def ticks_to_priceline(day=None):
# file name for the tick dataframe file, one day for a file
fn = get_tick_dataframe_filename_byday(day)
with pd.HDFStore(fn, 'r') as tick_store:
tick_dataframe = tick_store.select("tick")
all_stock_symbols = tick_dataframe.symbol.drop_duplicates()
sblist = []
# cut to small chunk
chunk = 300
for xx in range(len(all_stock_symbols) / chunk + 1):
sblist.append(all_stock_symbols[xx * chunk:(xx + 1) * stuck])
# run with all cpus
with sharedmem.MapReduce(np=mp.cpu_count()) as pool:
def work(chunk_list):
result = {}
for symbol in chunk_list:
data = tick_dataframe[tick_dataframe.symbol == symbol]
if not data.empty and len(data) > 99:
df1 = data.loc[:,
[u'timestamp', u'price', u'volume']]
df1['vol_diff'] = df1.volume.diff().fillna(0)
df2 = df1.loc[:, ['price', 'vol_diff']]
df2.price = df2.price.apply(int)
rs = df2.groupby('price').sum()
rs = rs.sort_index(ascending=0).reset_index()
result[symbol] = rs
return result
rslist = pool.map(work, sblist)
return rslist
here is a spark cluster in standalone mode I have already setup for testing. My main problem is how to rewrite the codes above.