I have a bunch of data points in the format (x, y, z, a, b, c) where x is an integer, y is a date, z is an integer, and a, b, and c are different numerical values (integers and floats).
My goal is to allow the user to provide two dates (so two y values), and then be presented with the values of (delta_a, delta_b, delta_c) for all existing (x, z) values; delta_a would be the increase/decrease in the value of a between the two dates, etc.
For example, let's say there's just 3 possible values of x and 3 possible values of z. The user provides two dates, y1=date(2023,2,7) and y2=date(2023,2,15) This data is presented in a table, like this:
Now in reality, there's about 30 different values of x and about 400 different values of z, so this table would actually have about 30 rows and about 400 columns (I have a horizontal scrollbar inside the table to look through the data).
Also, the value of y can be any date (at least since I started importing this data, about a month ago). So every day, about 12,000 new data entries are added to the database.
The way I'm currently handling this is I have a model DataEntry which basically looks like this:
class DataEntry(models.Model):
x = models.IntegerField()
y = models.DateField()
z = models.IntegerField()
a = models.IntegerField()
b = models.FloatField()
c = models.FloatField()
So every time the user generates a data table by inputting two dates, it can take quite a while since the system is comparing the ~12,000 data entries for y1 with the ~12,000 data entries for y2 and displaying all the different values. I will say that not every z value is actually displayed, because the user also inputs a minimum delta_a value, which is 5 by default - so if a did not increase by at least 5, then that table cell is empty. And if an entire column is just empty data cells, i.e. there's no x value for that given z column which had an a value increase by at least 5, then the column is hidden. So sometimes there's as few as 20 columns actually showing, but sometimes there's closer to 100. The user can choose to display all data though, meaning the full ~400 columns.
I hope I've explained this sufficiently. Is there a more efficient way to be handling all this? Does it actually make sense to have distinct objects for every single data entry or is there some way I could condense this down to maybe speed up the process?
Any pointers?
Related
I have a use-case in which a calculation happens in an excel-like way:
I show a tabular view to the user in which the user can manipulate the data in cells. When a cell is edited, all dependent columns have to be recalculated as well. This concerns all data types possible: booleans, strings, integers and numerical values. Is there a way to optimize the speed for the calculation? Every row in the table has around 150 columns where all the values are searched in other dataframes that are loaded in memory.
EXAMPLE:
When a cell in column B is edited, the row which is edited is selected as a Pandas Series. Then, column D might be recalculated as:
if df.at['A'] == 'condition1':
df.at['D'] = df.at['B'] + df.at['C']
else:
df.at['D'] = df.at['B'] - df.at['C']
I try to calculate true positive and negative and false positive and negative. For this, I want to compare the y-values of the functions. Both have as x-value, time, but 1 has 600001 values in 1200 seconds and the other 5990. How can I compare the values of y at the same point in a graph?The plot which I would like to compare
I can not make chunks, because 600001/5990 isn't an integer.
Does someone know where to start searching for an answer?
The plot in your question is not interpolating between points it is maintaining the previous y value till a new one comes-along (it is showing step changes). If that plot visually makes sense to you then computationally you should follow the same process.
Comparing at time now
find the point in motion score mat that has the highest x value not greater than now
find the point in Gross Motion that has the highest x value not greater than now
compare the y values of those two points.
Could you put these values in a dataframe?
If you get a dataframe with the column headers x1, y1, x2, y2, then you can do:
# get equal values and reset index
df1 = df[df['x1'].isin(df['x2'])].reset_index(drop=True)
df2 = df[df['x2'].isin(df['x1'])].reset_index(drop=True)
# combine columns from each dataframe
# returns all values where x1 == x2 so you can compare y values
pd.concat(df1[['x1', 'y1']], df2[['x2', 'y2']])
Coming from a Matlab background, where everything is a matrix/vector, it was very easy to loop through a given data set and build a matrix successively. Since the final object was a matrix, it was also very easy to extract specific elements of the matrix. I'm finding it rather problematic in Python. I've reproduced the code here to explain where I am getting stuck.
The original data just a time series with a month and a price. The goal is to simulate select subsets of these prices. The loop starts by collecting all months into one set, and then drops one month in each successive loop. For 12 months, I will have (n^2 - n)/2 + n, 78 columns in total in this example. To be clear, the n is the total number of time periods; 12 in this data set. The rows of the matrix will be the Z scores sampled from the standard normal variable - the goal is to simulate all 78 prices in one go in a matrix. The # of z scores is determined by the variable num_terminal_values, currently set to 5 for just keeping things simple/easy to visualize at this point.
Here's a link to a google sheet with the original matrix google sheet with corr mat . The code below may not work from the google sheet; the sheet is intended to show what the original data is. My steps (and Python code) are as follows:
#1 read the data
dfCrv = pd.read_excel(xl, sheet_name = 'upload', usecols = range(0,2)).dropna(axis=0)
#2 create the looper variables and then loop through the data to build a matrix. The rows in the matrix are Z values sampled from the standard normal (this is the variable num_terminal_values). The columns refers to each individual simulation month.
import datetime as dt
lst_zUCorr = []
num_terminal_values = 5
as_of = dt.datetime(2020, 12, 1)
max_months = dfCrv.shape[0]
sim_months = pd.date_range(dfCrv['term'].iloc[0], dfCrv['term'].iloc[-1], freq='MS')
end_month = dfCrv['term'].iloc[-1]
dfCrv = dfCrv.set_index('term',drop=False)
for runNum in range(max_months):
sim_month = dfCrv['term'].iloc[runNum]
ttm = ((sim_month - as_of).days)/365
num_months = (end_month.year - sim_month.year) * 12 + (end_month.month - sim_month.month) + 1
zUCorr = npr.standard_normal(size=(num_terminal_values, num_months))
lst_zUCorr.append(zUCorr)
investigate the objects
lst_zUCorr
z = np.hstack(lst_zUCorr)
z
So far, everything works fine. However, I don't know how to transform the object lst_zUCorr to a simple matrix. I've tried hstack etc.; but this still doesn't look like a matrix.
The next set of operations require data in simple matrix form; but what I'm getting here isn't a matrix. Here's a visual:
Key point/question - the final 5x78 matrix in Matlab can be used to do more operations. Is there a way to convert the equivalent Python object into a 5x78 matrix, or will I now need to do more coding to access specific subsets of the Python objects?
I created a table using plotly to calculate some financials, I would like to show the whole table in the graph interface (not just a few rows):
As you can see in the image, only 11 of my 30 rows are shown. I would like to show all the data of the table (all 30 rows with no scrollbar).
The code for the table is the following:
fig6 = go.Figure(data=[go.Table(
header=dict(values=list(df_table.columns),
fill_color='#d3d3d3',
align='left'),
cells=dict(values=[df_table['date'],
df_table['P/E_Ratio'],
df_table['Stock Price']],
fill_color='white',
align='left'))
])
As Juan correctly stated, adding height to fig6.update_layout() will do the trick. If you are however looking for a more dynamic workaround, you can use this function to calculate the height when input with a dataframe-
def calc_table_height(df, base=208, height_per_row=20, char_limit=30, height_padding=16.5):
'''
df: The dataframe with only the columns you want to plot
base: The base height of the table (header without any rows)
height_per_row: The height that one row requires
char_limit: If the length of a value crosses this limit, the row's height needs to be expanded to fit the value
height_padding: Extra height in a row when a length of value exceeds char_limit
'''
total_height = 0 + base
for x in range(df.shape[0]):
total_height += height_per_row
for y in range(df.shape[1]):
if len(str(df.iloc[x][y])) > char_limit:
total_height += height_padding
return total_height
You might have to play around with the other features if you have a different font_size than the default, or if you change the margin from the default. Also, the char_limit argument of the function is the other weakpoint as some characters take up more space than others, capital characters take up more space, and a single word if long can force a row to be extended. It should also be increased if the number or columns are less and vice versa. The function is written taking 4 table columns into consideration.
You would have to add fig6.update_layout(height=calc_table_height(df_table)) to make it work.
In case this has been answered in the past I want to apologize, I was not sure how to phrase the question.
I have a dataframe with 3d coordinates and rows with a scalar value (magnetic field in this case) for each point in space. I calculated the radius as the distance from the line at (x,y)=(0,0) for each point. The unique radius and z values are transferred into a new dataframe. Now I want to calculate the scalar values for every point (Z,R) in the volume by averaging over all points in the 3d system with equal radius.
Currently I am iterating over all unique Z and R values. It works but is awfully slow.
df is the original dataframe, dfn is the new one which - in the beginning - only contains the unique combinations of R and Z values.
for r in dfn.R.unique():
for z in df.Z.unique():
dfn.loc[(df["R"]==r)&(df["Z"]==z), "B"] = df["B"][(df["R"]==r)&(df["Z"]==z)].mean()
Is there any way to speed this up by writing a single line of code, in which pandas is given the command to grab all rows from the original dataframe, where Z and R have the values according to each row in the new dataframe?
Thank you in advance for your help.
Try groupby!!!
It looks like you can achieve with something like:
df[['R', 'Z', 'B']].groupby(['R', 'Z']).mean()