How to reverse a multi index pivot table in python

How to reverse a multi index pivot table in python - python

I have a dataframe that I convert to a pivot table, perform some imputation for missing data and then convert it back to the original form. The code I have appears to work in that it does not produce errors, but the output does not yield the expected number of rows. I suspect the problem is something to do with specifying the melting/stacking, but dont quite know what. I would be very grateful if someone was able to provide some help/support. Pictures, code and further info are below.
Thankyou in advance to anyone who helps.
The initial dataframe (data) contains 4 columns (geocode/country, variablename, year and value). There are 290,038 rows x 4 columns.
I convert data into the following form (country year pairs in each row, with each column being a variable). using the following code
data_temp = data.copy()
data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")
After performing some operations/imputation, I want to convert data_temp_grouped back to the original form as data. I have tried a few different methods, code does not produce the expected number of rows (290,038) .
This produces 4 columns but 827,929 rows.
data_temp_grouped2 = data_temp_grouped.copy()
data_temp_grouped3 = data_temp_grouped2.stack(0).reset_index(name='value')
This produces 111,5712 rows x 4 columns
data_temp_grouped2 = data_temp_grouped2.copy()
data_temp_grouped4 = data_temp_grouped4.reset_index()
data_temp_grouped4 = pd.melt(data_temp_grouped4,id_vars=["geocode","year"])
data_temp_grouped4

TLDR: I failed to account for "missing" data in wide format that was "added" to long format.
I just realized why I was having these problems. In the initial long format, there were ~290,000 rows. When converted into a wide format, there are 7748 (rows) x144 (cols). When this is squished into a long format, there are a total of 1,115,712 rows (7748 x 144). This increase comes due to the fact that missing data (country year pairs for certain variables) was not present in the initial data and only "emerged" during the conversion to wide format. Recoverting it again from long to wide the dimensions match : 7748 x 144 as expected.
For anyone else who might encounter the same problem, I've also included my code below.
The code is below
# grouping country year pairs
data_temp = data.copy()
# converts into multi indexed wide format (country year pairs)
data_temp_grouped = pd.pivot_table(data_temp, index=(['geocode','year']),columns="variablename",values="value")
# linearly interpolates the data for each country year pair
data_temp_grouped=data_temp_grouped.groupby("geocode").apply(lambda x : x.interpolate(method="linear",limit_direction="both"))
# Make a copy of the dataframe
data_temp_grouped2 = data_temp_grouped.copy()
# reset the index
data_temp_grouped2=data_temp_grouped2.reset_index()
data_temp_grouped2_melted=pd.melt(data_temp_grouped2,id_vars=['geocode',"year"],var_name='variablename', value_name='value')
data_temp_grouped2_melted
# to double check and convert back to multi index wide format
data_temp_grouped_check = pd.pivot_table(data_temp_grouped2_melted,index=(['geocode','year']),columns="variablename",values="value")

Related

How to speed up Python calculation with mixed data types?

I have a use-case in which a calculation happens in an excel-like way:
I show a tabular view to the user in which the user can manipulate the data in cells. When a cell is edited, all dependent columns have to be recalculated as well. This concerns all data types possible: booleans, strings, integers and numerical values. Is there a way to optimize the speed for the calculation? Every row in the table has around 150 columns where all the values are searched in other dataframes that are loaded in memory.
EXAMPLE:
When a cell in column B is edited, the row which is edited is selected as a Pandas Series. Then, column D might be recalculated as:
if df.at['A'] == 'condition1':
df.at['D'] = df.at['B'] + df.at['C']
else:
df.at['D'] = df.at['B'] - df.at['C']

Convert Dataframe into Series to return directly the value in that row and column

I'm trying to convert a DF I have, so that it can return true a piece of code like this (cat_totals being a df):
assert_equal(cat_totals["Interdisciplinary"],12296)
assert_equal(cat_totals["Engineering"],537583)
For this, I'm supposed to filter the main dataframe (df) into a subset containing only the Subject and the Total number of students. Then I group by the subject (Major_category) and sum. The numbers are correct in my dataframe, but if I try to run the above code, it throws up an error. How can I convert the dataframe so that the above assert_equals returns True?
cat_totals = df[['Major_category', 'Total']]
cat_totals = cat_totals.groupby(['Major_category']).sum()
cat_totals['Total'] = cat_totals['Total'].astype(int)
display(cat_totals.head(10))
With this the DF looks intuitively correct, but cat_totals['Interdisciplinary'] does not equal the value I'm looking for. In the table, the number that corresponds to the Major is correct, so the calculation is correct, but the format of the return value does not seem right.
Any help would be much appreciated! I'm quite new to working with Pandas, so it's a bit of a struggle.

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)

I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)

This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)

Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

Printing cells that correspond to largest values in original data set

I first found a cycle in a manufacturing process. I collected the 2 largest pressure values from the given cycles and printed them to a new sheet. I now need to capture the corresponding time to where the largest values land. This portion of my code looks like this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group','row_index'])
df2 = df.groupby('group')['Date/Time']
A sample snippet of the data I am trying to extract can be seen here:
Any help on this would be appreciated!

You can sort the data frame and take the last 2 rows per group. Typing this in the blind as you did not provide sample data:
df2 = (
df.sort_values(['group', 'Pressure'])
.groupby('group', sort=False)
.tail(2)
)

Fit Data in Pandas DataFrame

I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to reverse a multi index pivot table in python - python

Related

How to speed up Python calculation with mixed data types?

Convert Dataframe into Series to return directly the value in that row and column

Append std,mean columns to a DataFrame with a for-loop

Printing cells that correspond to largest values in original data set

Fit Data in Pandas DataFrame

Categories

Resources