I have a dataframe with a 3-level deep multi-index on the columns. I would like to compute subtotals across rows (sum(axis=1)) where I sum across one of the levels while preserving the others. I think I know how to do this using the level keyword argument of pd.DataFrame.sum. However, I'm having trouble thinking of how to incorporate the result of this sum back into the original table.
Setup:
import numpy as np
import pandas as pd
from itertools import product
np.random.seed(0)
colors = ['red', 'green']
shapes = ['square', 'circle']
obsnum = range(5)
rows = list(product(colors, shapes, obsnum))
idx = pd.MultiIndex.from_tuples(rows)
idx.names = ['color', 'shape', 'obsnum']
df = pd.DataFrame({'attr1': np.random.randn(len(rows)),
'attr2': 100 * np.random.randn(len(rows))},
index=idx)
df.columns.names = ['attribute']
df = df.unstack(['color', 'shape'])
Gives a nice frame like so:
Say I wanted to reduce the shape level. I could run:
tots = df.sum(axis=1, level=['attribute', 'color'])
to get my totals like so:
Once I have this, I'd like to tack it on to the original frame. I think I can do this in a somewhat cumbersome way:
tots = df.sum(axis=1, level=['attribute', 'color'])
newcols = pd.MultiIndex.from_tuples(list((i[0], i[1], 'sum(shape)') for i in tots.columns))
tots.columns = newcols
bigframe = pd.concat([df, tots], axis=1).sort_index(axis=1)
Is there a more natural way to do this?
Here is a way without loops:
s = df.sum(axis=1, level=[0,1]).T
s["shape"] = "sum(shape)"
s.set_index("shape", append=True, inplace=True)
df.combine_first(s.T)
The trick is to use the transposed sum. So we can insert another column (i.e. row) with the name of the additional level, which we name exactly like the one we summed over. This column can be converted to a level in the index with set_index. Then we combine df with the transposed sum. If the summed level is not the last one you might need some level reordering.
Here's my brute-force way of doing it.
After running your well written (thank you) sample code, I did this:
attributes = pd.unique(df.columns.get_level_values('attribute'))
colors = pd.unique(df.columns.get_level_values('color'))
for attr in attributes:
for clr in colors:
df[(attr, clr, 'sum')] = df.xs([attr, clr], level=['attribute', 'color'], axis=1).sum(axis=1)
df
Which gives me:
Related
I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:
As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs
Data
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
Notes on the original code
There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.
So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
Let Excel do formatting
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
Full code
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
P.S.
Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
Update
If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)
Suppose I have df1:
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
I would like to create df2 with the same shape, index and columns. I often find myself doing something like this:
df2= pd.DataFrame(np.ones(shape(df1),index = df1.index,columns =df1.columns)
This is less than ideal. What's the pythonic way?
How about this:
df2 = df1.copy()
df2[:] = 1 # Or any other value, for the matter
The last line is not even necessary if all you want is to preserve the shape and the row/column headers.
You can also use the dataframe method "where" which will allow you to keep data based on condition and preserve the shape/index of the original df.
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
df2= df1.where(df1['foo'] % 2 == 0, 9999)
df2
This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 4 years ago.
I would like to create views or dataframes from an existing dataframe based on column selections.
For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn't work:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')
# This returns an error ("unhashable type: set")
df2 = df[my_cols]
What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?
You can either Drop the columns you do not need OR Select the ones you need
# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)
# Select the ones you want
df1 = df[['a','d']]
There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.
Here, the result is used to remove columns B and D from df:
df2 = df[df.columns.difference(['B', 'D'])]
Note that it's a set-based method, so duplicate column names will cause issues, and the column order may be changed.
Advantage over drop: you don't create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:
# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns
# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])
df = df.drop_duplicates(subset=subset)
Another option, without dropping or filtering in a loop:
import numpy as np
import pandas as pd
# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]
# or more simply include columns:
df[['A', 'B']]
# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]
# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically
df[df.columns.difference(['C', 'D'])]
You don't really need to convert that into a set:
cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Also have a look into the built-in DataFrame.filter function.
Minimalistic but greedy approach (sufficient for the given df):
df.filter(regex="[^BD]")
Conservative/lazy approach (exact matches only):
df.filter(regex="^(?!(B|D)$).*$")
Conservative and generic:
exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
You have 4 columns A,B,C,D
Here is a better way to select the columns you need for the new dataframe:-
df2 = df1[['A','D']]
if you wish to use column numbers instead, use:-
df2 = df1[[0,3]]
You just need to convert your set to a list
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Here's how to create a copy of a DataFrame excluding a list of columns:
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)
But be careful! You mention views in your question, suggesting that if you changed df, you'd want df2 to change too. (Like a view would in a database.)
This method doesn't achieve that:
>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
A B C D
0 999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
A C
0 0.251262 -1.980673
Note also that this is also true of #piggybox's method. (Although that method is nice and slick and Pythonic. I'm not doing it down!!)
For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.
In a similar vein, when reading a file, one may wish to exclude columns upfront, rather than wastefully reading unwanted data into memory and later discarding them.
As of pandas 0.20.0, usecols now accepts callables.1 This update allows more flexible options for reading columns:
skipcols = [...]
read_csv(..., usecols=lambda x: x not in skipcols)
The latter pattern is essentially the inverse of the traditional usecols method - only specified columns are skipped.
Given
Data in a file
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
filename = "foo.csv"
df.to_csv(filename)
Code
skipcols = ["B", "D"]
df1 = pd.read_csv(filename, usecols=lambda x: x not in skipcols, index_col=0)
df1
Output
A C
0 0.062350 0.076924
1 -0.016872 1.091446
2 0.213050 1.646109
3 -1.196928 1.153497
4 -0.628839 -0.856529
...
Details
A DataFrame was written to a file. It was then read back as a separate DataFrame, now skipping unwanted columns (B and D).
Note that for the OP's situation, since data is already created, the better approach is the accepted answer, which drops unwanted columns from an extant object. However, the technique presented here is most useful when directly reading data from files into a DataFrame.
A request was raised for a "skipcols" option in this issue and was addressed in a later issue.