Add df under other df Pandas - python

I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:

As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs

Data
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
Notes on the original code
There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.
So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
Let Excel do formatting
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
Full code
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
P.S.
Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
Update
If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:

Related

When I name columns in dataframe it deletes my data

I created a pandas DataFrame that holds various summary statistics for several variables in my dataset. I want to name the columns of the dataframe, but every time I try it deletes all my data. Here is what it looks like without column names:
MIN = df.min(axis=0, numeric_only=True)
MAX = df.max(axis=0, numeric_only=True)
RANGE = MAX-MIN
MEAN = df.mean(axis=0, numeric_only=True)
MED = df.median(axis=0, numeric_only=True)
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
sum_stats = pd.DataFrame(data=sum_stats)
sum_stats
My output looks like this:
But for some reason when I add column names:
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']
sum_stats = pd.DataFrame(data=sum_stats, columns=columns)
sum_stats
My output becomes this:
Any idea why this is happening?
From the documentation for the columns parameter of the pd.DataFrame constructor:
[...] If data contains column labels, will perform column selection instead.
That means that, if the data passed is already a dataframe, for example, the columns parameter will act as a list of columns to select from the data.
If you change columns to equal a list of some columns that already exist in the dataframe that you're using, e.g. columns=[1, 4], you'll see that the resulting dataframe only contains those two columns, copied from the original dataframe.
Instead, you can assign the columns after you create the dataframe:
sum_stats.columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']

How to append several rows to an existing pandas dataframe with number of rows depending on a comprehension list

I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.

How to plot similarly named columns using pandas?

I've read in some csv files using pandas. For now it's only two files, but in a few weeks I'll be working with several hundred csv files with the same data variables.
I've used a for loop to read in the files and appended the dataframes to a single list, and then used this for loop to differentiate the names some:
for i, df in enumerate(separate_data, 1):
df.columns = [col_name+'_df{}'.format(i) for col_name in df.columns]
My question is this, how can I compare the variables between the files using a bar plot? For example, one of the common variables is temperature, so after differentiating the column names I now have temp_df1 and temp_df2. How would I go about calling all temperature columns to compare them in a bar plot?
I tried using this, but could not get it to work:
for df in separate_data:
temp_comp = separate_data.plot.bar(y='temp*')
Let's say you have the three dataframes below, each with a temp column. Here is how you iteratively combine the temp columns into a single new dataframe and plot them:
import matplotlib.pyplot as plt
import pandas as pd
df1 = pd.DataFrame({'temp':[100,150,200], 'pressure': [10,20,30]})
df2 = pd.DataFrame({'temp':[50,70,100], 'pressure': [10,25,40]})
df3 = pd.DataFrame({'temp':[110,80,120], 'pressure': [8,20,50]})
df_list = [df1,df2,df3]
df_combined = pd.DataFrame()
for i, df in enumerate(df_list):
df_combined[f'df{i+1}'] = df['temp']
print('Combined Dataframe\n', df_combined)
df_combined.plot(kind = 'bar')
plt.ylabel('Temp')
plt.show()
#Combined Dataframe
df1 df2 df3
0 100 50 110
1 150 70 80
2 200 100 120
Note that this assumes that all your dataframes have the same length. If this is not true, you can just read the first n (e.g. 50) rows from each dataframe to ensure equal lengths with:
df = pd.read_csv('sample.csv', nrows=50).
If you can, I would read into one data frame with an identifier to make the aggregation easier. For example:
import pandas as pd
filenames = ["file_1.csv", "file_2.csv"]
df = pd.concat(
[
pd.read_csv(filename).assign(filename=filename.split(".")[0])
for filename in filenames
]
)
df.groupby("filename")["column_to_plot"].mean().plot.bar()

multiconditional mapping in python pandas

I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.

Add subtotal columns in pandas with multi-index

I have a dataframe with a 3-level deep multi-index on the columns. I would like to compute subtotals across rows (sum(axis=1)) where I sum across one of the levels while preserving the others. I think I know how to do this using the level keyword argument of pd.DataFrame.sum. However, I'm having trouble thinking of how to incorporate the result of this sum back into the original table.
Setup:
import numpy as np
import pandas as pd
from itertools import product
np.random.seed(0)
colors = ['red', 'green']
shapes = ['square', 'circle']
obsnum = range(5)
rows = list(product(colors, shapes, obsnum))
idx = pd.MultiIndex.from_tuples(rows)
idx.names = ['color', 'shape', 'obsnum']
df = pd.DataFrame({'attr1': np.random.randn(len(rows)),
'attr2': 100 * np.random.randn(len(rows))},
index=idx)
df.columns.names = ['attribute']
df = df.unstack(['color', 'shape'])
Gives a nice frame like so:
Say I wanted to reduce the shape level. I could run:
tots = df.sum(axis=1, level=['attribute', 'color'])
to get my totals like so:
Once I have this, I'd like to tack it on to the original frame. I think I can do this in a somewhat cumbersome way:
tots = df.sum(axis=1, level=['attribute', 'color'])
newcols = pd.MultiIndex.from_tuples(list((i[0], i[1], 'sum(shape)') for i in tots.columns))
tots.columns = newcols
bigframe = pd.concat([df, tots], axis=1).sort_index(axis=1)
Is there a more natural way to do this?
Here is a way without loops:
s = df.sum(axis=1, level=[0,1]).T
s["shape"] = "sum(shape)"
s.set_index("shape", append=True, inplace=True)
df.combine_first(s.T)
The trick is to use the transposed sum. So we can insert another column (i.e. row) with the name of the additional level, which we name exactly like the one we summed over. This column can be converted to a level in the index with set_index. Then we combine df with the transposed sum. If the summed level is not the last one you might need some level reordering.
Here's my brute-force way of doing it.
After running your well written (thank you) sample code, I did this:
attributes = pd.unique(df.columns.get_level_values('attribute'))
colors = pd.unique(df.columns.get_level_values('color'))
for attr in attributes:
for clr in colors:
df[(attr, clr, 'sum')] = df.xs([attr, clr], level=['attribute', 'color'], axis=1).sum(axis=1)
df
Which gives me:

Categories