Pandas style-dataframe has multi-index columns misaligned

Pandas style-dataframe has multi-index columns misaligned - python

I need to style a Pandas dataframe that has a multi-index column arrangement. As an example of my dataframe, consider:
df = pd.DataFrame([[True for a in range(12)]])
df.columns = pd.MultiIndex.from_tuples([(a,b,c) for a in ['Tim','Sarah'] for b in ['house','car','boat'] for c in ['new','used']])
df
This displays the multi-index columns just as I would expect it to, and is easy to read:
However, when I convert it to a style dataframe:
df.style
The column headers suddenly shift and it becomes confusing to figure out where each level begins and ends:
Can anyone help me undo this, to return it to the more readable left-justified setup? I looked through about 10 other posts on SO but none addressed this issue.
TIA.
UPDATE 12/9:
My primary product is a .to_excel() output, and it displays correctly, I just learned. So while this issue is still open for solutions, I am not in urgent need of a solution, but am in search of one nonetheless. Thanks.

Related

Am I able to assign keywords to a dataset header retrieved through Pandas and input the information under the header into eg an equation?

The title does say a lot but I'm completely new to Python and Pandas as I want to know how to do this. I have a dataframe that's roughly 21,000 rows with 4 headers. I am wanting to assign a 'keyword' (or use the original header) to each column with 21,000 rows, and have the information with the relevant data inputted into an equation with other headers doing the same thing. This data will eventually be exported to ArcGIS for processing into visual markers. I've gotten as far as retrieving the information into Pandas, and now I'm stuck.
Thank you in advance.

I can't comment as I don't have enough rep. Have you tried df['column name']. This is how you can use a column in equations. So, without knowing what you actually need, if you want to use columns in equations, you can do something like this:
df['new column'] = 3 * df['column name']
Or if you want to multiply 2 columns:
df['c'] = df['a'] * df['b']
Or use multiply method - (check here for original post):
df[[c'] = df[['a', 'b']].multiply(df['c'], axis="index")
Without actually seeing a df sample or seeing the intended output, I can't be sure any of these is what you're actually after

Filter nan values out of rows in pandas

I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.

You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html

How to insert data into a existing dataframe, replacing values according to a conditional

I'm looking to insert information into a existing dataframe, this dataframe shape is 2001 rows × 13 columns, however, only the first column has information.
I have 12 more columns, but these are not the same dimension as the main dataframe, so I'd like to insert this additional columns into the main one using a conditional.
Example dataframe:
This in an example, I want to insert the var column into the 2001 × 13 dataframe, using the date as a conditional and in case there is no date, it skips the row or simply adds a 0.
I'm really new to python and programming in general.

Without a minimal working example it is hard to provide you with clear recommendations, but I think what you are looking for is the .loc a pd.DataFrame. What I would recommend you doing is the following:
Selection of rows with .loc works better in your case if the dates are first converted to date-time, so a first step is to make this conversion as:
# Pandas is quite smart about guessing date format. If this fails, please check the
# documentation https://docs.python.org/3/library/datetime.html to learn more about
# format strings.
df['date'] = pd.to_datetime(df['date'])
# Make this the index of your data frame.
df.set_index('date', inplace=True)
It is not clear how you intend to use conditionals/what is the content of your other columns. Using .loc this is pretty straightforward
# At Feb 1, 2020, add a value to columns 'var'.
df.loc['2020-02-01', 'var'] = 0.727868
This could also be used for ranges:
# Assuming you have a second `df2` which as a datetime columns 'date' with the
# data you wish to add to `df`. This will only work if all df2['date'] are found
# in df.index. You can workout the logic for your case.
df.loc[df2['date'], 'var2'] = df2['vals']
If the logic is to complex and the dataframe is not too large, iterating with .iterrows could be easier, specially if you are beginning with Python.
for idx, row in df.iterrows():
if idx in list_of_other_dates:
df.loc[i, 'var'] = (some code here)
Please clarify a bit your problem and you will get better answers. Do not forget to check the documentation.

Reference Matrix in Pandas similar to Excel

I am trying to create a reference matrix in Pandas that looks like the below image in excel. I decided upon the index and column values, by simply entering in the values for the dates myself. Then, I am able to reference each column and index value for every calculation in the matrix. The calculations below are just for display.
In Pandas, I have been using the Pivot table function to produce a similar table. However, the Pivot table only uses column values if they are present in the data. See the screenshot below for the issue. I have values for 2018-05 in the index, but it doesn't appear in the columns. As such, the data is incomplete.
Therefore the Pivot table functionality does not work for me. I need to be able to manually decide on the column headers and the index values, similar to the example above in Excel.
Any help would be greatly appreciated as I cannot figure this one out!
repayments[(repayments.top_repayment_delinquency_reason == 'misappropriation_of_funds') & (repayments.repaid_date < date.today() - pd.offsets.MonthBegin(1))].pivot_table(values='amount_principal',
index='top_repayment_due_month', columns='repaid_month', aggfunc=sum)

I found an answer in the end.
dates_eom = pd.date_range('2018-5-31', (date.today() + relativedelta(months=+0)), freq='M')
dates_eom = dates_eom.to_period('M')
df = pd.DataFrame(index=dates_eom, columns=dates_eom)

A clean and efficient way to update cells in pandas DataFrames

I am looking for a cleaner way to achieve the following:
I have a DataFrame with certain columns that I want to update if new information arrives. This "new information" in for of a pandas DataFrame (from a CSV file) can have more or less rows, however, I am only interested in adding
Original DataFrame
DataFrame with new information
(Note the missing name "c" here and the change in "status" for name "a")
Now, I wrote the following "inconvenient" code to update the original DataFrame with the new information
Updating the "status" column based on the "name" column
for idx,row in df_base.iterrows():
if not df_upd[df_upd['name'] == row['name']].empty:
df_base.loc[idx, 'status'] = df_upd.loc[df_upd['name'] == row['name'], 'status'].values
It achieves exactly what I want, but it just does neither look nice nor efficient, and I hope that there might be a cleaner way. I tried the pd.merge method, however, the problem is that it would be adding new columns instead of "updating" the cells in that column.
pd.merge(left=df_base, right=df_upd, on=['name'], how='left')
I am looking forward to your tips and ideas.

You could set_index("name") and then call .update:
>>> df_base = df_base.set_index("name")
>>> df_upd = df_upd.set_index("name")
>>> df_base.update(df_upd)
>>> df_base
status
name
a 0
b 1
c 0
d 1
More generally, you can set the index to whatever seems appropriate, update, and then reset as needed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas style-dataframe has multi-index columns misaligned - python

Related

Am I able to assign keywords to a dataset header retrieved through Pandas and input the information under the header into eg an equation?

Filter nan values out of rows in pandas

How to insert data into a existing dataframe, replacing values according to a conditional

Reference Matrix in Pandas similar to Excel

A clean and efficient way to update cells in pandas DataFrames

Categories

Resources