I have a use-case in which a calculation happens in an excel-like way:
I show a tabular view to the user in which the user can manipulate the data in cells. When a cell is edited, all dependent columns have to be recalculated as well. This concerns all data types possible: booleans, strings, integers and numerical values. Is there a way to optimize the speed for the calculation? Every row in the table has around 150 columns where all the values are searched in other dataframes that are loaded in memory.
EXAMPLE:
When a cell in column B is edited, the row which is edited is selected as a Pandas Series. Then, column D might be recalculated as:
if df.at['A'] == 'condition1':
df.at['D'] = df.at['B'] + df.at['C']
else:
df.at['D'] = df.at['B'] - df.at['C']
I'm trying to convert a DF I have, so that it can return true a piece of code like this (cat_totals being a df):
assert_equal(cat_totals["Interdisciplinary"],12296)
assert_equal(cat_totals["Engineering"],537583)
For this, I'm supposed to filter the main dataframe (df) into a subset containing only the Subject and the Total number of students. Then I group by the subject (Major_category) and sum. The numbers are correct in my dataframe, but if I try to run the above code, it throws up an error. How can I convert the dataframe so that the above assert_equals returns True?
cat_totals = df[['Major_category', 'Total']]
cat_totals = cat_totals.groupby(['Major_category']).sum()
cat_totals['Total'] = cat_totals['Total'].astype(int)
display(cat_totals.head(10))
With this the DF looks intuitively correct, but cat_totals['Interdisciplinary'] does not equal the value I'm looking for. In the table, the number that corresponds to the Major is correct, so the calculation is correct, but the format of the return value does not seem right.
Any help would be much appreciated! I'm quite new to working with Pandas, so it's a bit of a struggle.
I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
I first found a cycle in a manufacturing process. I collected the 2 largest pressure values from the given cycles and printed them to a new sheet. I now need to capture the corresponding time to where the largest values land. This portion of my code looks like this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group','row_index'])
df2 = df.groupby('group')['Date/Time']
A sample snippet of the data I am trying to extract can be seen here:
Any help on this would be appreciated!
You can sort the data frame and take the last 2 rows per group. Typing this in the blind as you did not provide sample data:
df2 = (
df.sort_values(['group', 'Pressure'])
.groupby('group', sort=False)
.tail(2)
)
I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates