I've got some data (NOAA-provided weather forecasts) I'm trying to work with. There are various data series (temperature, humidity, etc), each of which contains a series of data points, and indexes into an array of datetimes, on various time scales (Some series are hourly, others 3-hourly, some daily). Is there any sort of library for dealing with data like this, and accessing it in a user-friendly way.
Ideal usage would be something like:
db = TimeData()
db.set_val('2010-12-01 12:00','temp',34)
db.set_val('2010-12-01 15:00','temp',37)
db.set_val('2010-12-01 12:00','wind',5)
db.set_val('2010-12-01 13:00','wind',6)
db.query('2010-12-01 13:00') # {'wind':6, 'temp':34}
Basically the query would return the most recent value of each series.
I looked at scikits.timeseries, but it isn't very amenable to this use case, due to the amount of pre-computation involved (it expects all the data in one shot, no random-access setting).
If your data is sorted you can use the bisect module to quickly get the entry with the greatest time less than or equal to the specified time.
Something like:
i = bisect_right(times, time)
# times[j] <= time for j<i
# times[j] > time for j>=i
if times[i-1] == time:
# exact match
value = values[i-1]
else:
# interpolate
value = (values[i-1]+values[i])/2
SQLite has a date type. You can also convert all the times to seconds since epoch (by going through time.gmtime() or time.localtime()), which makes comparisons trivial.
It is a classic row-to-column problem, in a good SQL DBMS you can use unions:
SELECT MAX(d_t) AS d_t, SUM(temp) AS temp, SUM(wind) AS wind, ... FROM (
SELECT d_t, 0 AS temp, value AS wind FROM table
WHERE type='wind' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
SELECT d_t, value, 0 FROM table
WHERE type='temp' AND d_t >= some_date
ORDER BY d_t DESC LIMIT 1
UNION
...
) q1;
The trick is to make a subquery for each dimension while providing placeholder columns for the other dimensions. In Python you can use SQLAlchemy to dynamically generate a query like this.
Related
I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.
I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.
I have the following code that will be executed more than 1.5 million times with different queries that are dynamically generated from a configuration file
I'm not trying to optimize the query conditions, I'm trying to see if rather than performing the query 3 times in different columns I can do the query once and get the same result
csv_file_profit = pd.read_csv('C:\\Users\\test_data.csv')
if query_str:
profit_sum = csv_file_profit.query(query_str)['P/L'].sum()
trans_count = csv_file_profit.query(query_str)['Tran ID'].count()
atr_profit_sum = csv_file_profit.query(query_str)['Max ATR Profit'].sum()
Is there a faster way to get the same result?
One improvement can be to compute filtered DataFrame once and then
perform further computation on this filtered result. Something like:
if query_str:
filtered = csv_file_profit.query(query_str)
profit_sum = filtered['P/L'].sum()
trans_count = filtered['Tran ID'].count()
atr_profit_sum = filtered['Max ATR Profit'].sum()
Execution time is about 50 % compared to your code, on a very limited
DataFrame (4 rows). For bigger DataFrame the difference should be bigger.
I have a large Twitter data stream, and I am interested in analyzing the relationships of hashtags in each tweet. For example, if hashtag A and hashtag B appear in the same tweet, I would record this tweet as "A-B" together with the timestamp of the tweet.
As such, sample inputs are:
hashtags, Timestamp
A-B, created_time: 2016-04-07T01:33:19Z
B-C, created_time: 2016-04-07T03:53:19Z
C, created_time: 2016-04-08T03:31:19Z
C-A, created_time: 2016-04-08T04:33:19Z
A-D, created_time: 2016-04-07T07:33:19Z # (Note: an example of out of order)
B-D, created_time: 2016-04-09T09:33:19Z
Note that the stream data might not be ordered by time.
Tasks:
1) Use the stream data to build a graph of hashtags (A, B, C, C...) and their relationship with one another.
2) Calculate the average degree of a vertex in a graph and update this each time a new stream data appears (across a one-day sliding window).
The average degree of a vertex is defined as: degree = number of edges/number of nodes. For example, if the current graph is A-B, then the average degree = 1(edge)/2 (# of nodes).
Sample outputs:
Output
1/2,
2/3,
1/2,
1/2,
2/3,
1/2
What is the most efficient Python data structure to store a such timestamp data in order to calculate the average degree of vertex in a one-day rolling window?*
My intuition is to use a dictionary to store and maintain the hashtags as key, and the created_time as values. So in order to maintain a one-day window, I need to first sort the dictionary, which takes lots of time. Is there a more efficient way to automatically store the timestamp data based on time (no need for sort)?
I found posts using the Pandas DataFrame and rolling functions to do the similar tasks. But in my case, I am looking for a most efficient data structure to do the task.
Updates:
After more research about my question, I found this question is a good match to mine.
Ideal data structure with fast lookup, fast update and easy comparison/sorting
The key idea is to use [heapq][2]
The tweets can be expected to be mostly sorted, so a sequence type with insertion sort should be a good way to get them ordered. Add a rolling window to replace the oldest ones after you reach 24 hours.
For efficient insertions, you'll want a sequence type with better insertion support than list. I'd give blist a try. In fact it provides a sortedlist type, so you could try that out and see what kind of performance it achieves.
This all assumes that your stream doesn't grow too fast to keep a whole day's tweets in memory. If it does, you'll have to delegate to some kind of database.
I would use pandas. Here is an example implementation which sorts out timestamps based on a window. You would need to copy your data into a dataframe first.
import datetime
import dateutil.relativedelta
days_back = 1
datetimeFormat = '%Y-%m-%d %H:%M:%S'
dt_now = datetime.datetime.now()
start_date = dt_now - dateutil.relativedelta.relativedelta(days=days_back)
start_date = start_date.strftime(datetimeFormat)
df2 = df[df['time_stamp'] > start_date]
i have an table in sqlite using pysqlite:
create table t
(
id integer primary key not null,
time datetime not null,
price decimal(5,2)
)
how can i from this data calculate moving average with window X seconds large with an sql statement?
As far as I understand your question, You do not want the average over the last N items, but over the last x seconds, am I correct?
Well, this gives you the list of all prices recorded the last 720 seconds:
>>> cur.execute("SELECT price FROM t WHERE datetime(time) > datetime('now','-720 seconds')").fetchall()
of course, you can feed that to the AVG-SQL-Function, to get the average price in that window:
>>> cur.execute("SELECT AVG(price) FROM t WHERE datetime(time) > datetime('now','-720 seconds')").fetchall()
You can also use other time units, and even chain them.
For example, to obtain the average price for the last one and a half hour, do:
>>> cur.execute("SELECT AVG(price) FROM t WHERE datetime(time) > datetime('now','-30 minutes','-1 hour')").fetchall()
Edit: SQLite datetime reference can be found here
The moving average with a window x units large is given at time i by:
(x[i] + x[i+1] + ... + x[i+x-1]) / x
To compute it, you want to make a LIFO stack (which you can implement as a queue) of size x, and compute its sum. You can then update the sum by adding the new value and subtracting the old one from the old sum; you get the new one from the database and the old one by popping the first element off the stack.
Does that make sense?