When I name columns in dataframe it deletes my data - python

I created a pandas DataFrame that holds various summary statistics for several variables in my dataset. I want to name the columns of the dataframe, but every time I try it deletes all my data. Here is what it looks like without column names:
MIN = df.min(axis=0, numeric_only=True)
MAX = df.max(axis=0, numeric_only=True)
RANGE = MAX-MIN
MEAN = df.mean(axis=0, numeric_only=True)
MED = df.median(axis=0, numeric_only=True)
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
sum_stats = pd.DataFrame(data=sum_stats)
sum_stats
My output looks like this:
But for some reason when I add column names:
sum_stats = pd.concat([MIN, MAX, RANGE, MEAN, MED], axis=1)
columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']
sum_stats = pd.DataFrame(data=sum_stats, columns=columns)
sum_stats
My output becomes this:
Any idea why this is happening?

From the documentation for the columns parameter of the pd.DataFrame constructor:
[...] If data contains column labels, will perform column selection instead.
That means that, if the data passed is already a dataframe, for example, the columns parameter will act as a list of columns to select from the data.
If you change columns to equal a list of some columns that already exist in the dataframe that you're using, e.g. columns=[1, 4], you'll see that the resulting dataframe only contains those two columns, copied from the original dataframe.
Instead, you can assign the columns after you create the dataframe:
sum_stats.columns = ['MIN', 'MAX', 'RANGE', 'MEAN', 'MED']

Related

Add df under other df Pandas

I'm using a for to generate a excel file to graph the data from a df so I'm using value_counts but I would like to add under this df a second one with the same data but with percentages so my code is this one:
li = []
for i in range(0, len(df.columns)):
value_counts = df.iloc[:, i].value_counts().to_frame().reset_index()
value_percentage = df.iloc[:, i].value_counts(normalize=True).to_frame().reset_index()#.drop(columns='index')
value_percentage = (value_percentage*100).astype(str)+'%'
li.append(value_counts)
li.append(value_percentage)
data = pd.concat(li, axis=1)
data.to_excel("resultdf.xlsx") #index cleaned
Basically I need it to look like this:
As long as the column names match between the two data frames you should be able to use pd.concat() to concatenate the two data frames. To concatenate them vertically, I think you should use axis=0 instead of axis=1 see docs
Data
Let's prepare some dummy data to work with. Based on the provided screenshot, I'm assuming that the raw data are sort of music genres grade on a scale of 1 to 5. So I'm gonna use as data something like this:
import pandas as pd
from numpy.random import default_rng
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
Notes on the original code
There's no need to iterate by a column index. We can iterate through column names, as in for column in df.columns: df[column] ...
I think it's better to format data with help of map('.0%'.format) before transforming them to frame.
Instead of appending counted and normalized values one by one we better pd.concat them vertically into a single frame and append it to the list.
So the original code may be rewritten like this:
li = []
for col in df.columns:
value_counts = df[col].value_counts()
value_percentage = df[col].value_counts(normalize=True).map('{:.0%}'.format)
li.append(pd.concat([value_counts, value_percentage]).to_frame().reset_index())
resultdf = pd.concat(li, axis=1)
resultdf.to_excel("resultdf.xlsx")
Let Excel do formatting
What if we let Excel format the data as percentages on its own? I think that the easiest way to do this is to use Styler. But before that, I suggest to get rid of Index columns. As I can see, all of them refer to the same grades 1,2,3,4,5. So we can use them as the common index thus making indexes meaningful. Also I'm gonna use MultiIndex to separate counted and normalized values like this:
formula = ['counts', 'percent']
values = [1, 2, 3, 4, 5]
counted = pd.DataFrame(index=pd.MultiIndex.from_product([formula, values], names=['formula', 'values']))
counted is our data container and it's empty at the moment. Let's fill it in:
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = pd.concat([counts, percent], keys=formula)
Having these data, let's apply some style to them and only then transform into an Excel file:
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=pd.IndexSlice['counts', columns])
.set_properties(**{'number-format': '0%'}, subset=pd.IndexSlice['percent', columns])
)
styled_data.to_excel('test.xlsx')
Now our data in Excel are looking like this:
All of them are numbers and we can use them in further calculations.
Full code
from pandas import DataFrame, MultiIndex, IndexSlice, concat
from numpy.random import default_rng
# Initial parameters
rng = default_rng(0)
data_length = 100
genres = ['Pop', 'Dance', 'Rock', 'Jazz']
values = [1, 2, 3, 4, 5]
formula = ['counts', 'percent']
file_name = 'test.xlsx'
# Prepare data
data = rng.integers(min(values), max(values), size=(data_length, len(genres)), endpoint=True)
df = DataFrame(data, columns=genres)
# Prepare a container for counted data
index = MultiIndex.from_product([formula, values], names=['formula', 'values'])
counted = DataFrame(index=index)
# Fill in counted data
for col in df.columns:
counts = df[col].value_counts()
percent = counts / counts.sum()
counted[col] = concat([counts, percent], keys=formula)
# Apply number formatting and save the data in a Excel file
styled_data = (
counted.style
.set_properties(**{'number-format': '0'}, subset=IndexSlice['counts', :])
.set_properties(**{'number-format': '0%'}, subset=IndexSlice['percent', :])
)
styled_data.to_excel(file_name)
P.S.
Note not to get confused. In case of the used dummy data we can see identical values in counts and percent parts. That's because of how data were built. I used 100 total number of values in the initial data frame df. So the number of value_counts and their percentage are equal.
python 3.11.0
pandas 1.5.1
numpy 1.23.4
Update
If we wanna keep values for each column of the original data, but use Styler to set a number format for a second half of the output frame, then we should somehow rename Index columns, because Styler requires unique column/index labels in a passed DataFrame. We can ether rename them somehow (e.g. "Values.Pop", etc.) or we can use a multi indexing for columns, which IMO looks better. Also let's take into account that number of unique values may differ for different columns. Which means that we have to collect data separately for couts and percent values before connecting them:
import pandas as pd
from numpy.random import default_rng
# Prepare dummy data with missing values in some columns
rng = default_rng(0)
columns = ['Pop', 'Dance', 'Rock', 'Jazz']
data = rng.integers(1, 5, size=(100, len(columns)), endpoint=True)
df = pd.DataFrame(data, columns=columns)
df['Pop'].replace([1,5], 2, inplace=True)
df['Dance'].replace(3, 5, inplace=True)
# Collect counted values and their percentage
counts, percent = [], []
for col in df.columns:
item = (
df[col].value_counts()
.rename('count')
.rename_axis('value')
.to_frame()
.reset_index()
)
counts.append(item)
percent.append(item.assign(count=item['count']/item['count'].sum()))
# Combine counts and percent in a single data frame
counts = pd.concat(counts, axis=1, keys=df.columns)
percent = pd.concat(percent, axis=1, keys=df.columns)
resultdf = pd.concat([counts, percent], ignore_index=True)
# Note: In order to use resultdf in styling we should produce
# unique index labels for the output data.
# For this purpose we can use ignore_index=True
# or assign some keys for each part, e.g. key=['counted', 'percent']
# Format the second half of resultdf as Percent, ie. "0%" in Excel terminology
styled_result = (
resultdf.style
.set_properties(
**{'number-format': '0%'},
subset=pd.IndexSlice[len(resultdf)/2:, pd.IndexSlice[:,'count']])
# if we used keys instead of ignore_index to produce resultdf
# then len(resultdf)/2: should be replaced with 'percent'
# i.e. the name of the percent part.
)
styled_result.to_excel('my_new_excel.xlsx')
The output in this case is gonna look like this:

Creating a function to describe a set of dataframes

I want to create a function which will give the output of df.describe for all the dataframes which is passed to the function argument.
My idea was to store all the dataframe(whom i need to describe) names as columns in a seperate dataframe (x) and then pass this to the function.
Here is what i have made and the output :
The problem is that its only showing description of only one dataframe
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
return(column.describe())
data = {'UGCAA':[],'FAPG1':[],'ACSO5':[],'LGHF2':[],'LGMP8':[],'GGAF1':[]}
df=pd.DataFrame(data)
df
des(df)
Sales
count 948.000000
mean 876.415612
std 874.373236
min 1.000000
25% 298.750000
50% 619.500000
75% 1148.500000
max 7345.00000
I believe you can create list of DataFrames and last concat together:
def des(df):
dfs = []
for column in df.columns:
df1=pd.read_csv('SKUs\\'+column+'.csv')
df1['Date'] = pd.to_datetime(df1['Date'].astype(str),
format ='%d%m%y',infer_datetime_format=True)
df1.dropna(inplace=True)
dfs.append(df1.describe())
return pd.concat(dfs, axis=1, keys=df.columns)
It's because you are looping over and reseting column each time while only returning one. To just visualize it, you can just print the describe in each loop, or store them together in one variable and handle it after the loop.
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
print(column.describe())

Sorting a grouped dataframe

I have a dataframe with columns ['name', 'sex', 'births', 'year']. I then group the dataframe on the basis of name to create 2 new columns "max" and "total".
trendy_names['max'] = trendy_names.groupby(['name'], as_index = False)['births'].transform('max')
trendy_names['total'] = trendy_names.groupby(['name'], as_index = False)['births'].transform('sum')
Using these 2 columns, I create a calculated column "trendiness".
trendy_names['trendiness'] = trendy_names['max']/trendy_names['total']
Then, I segregate those that have a total number of births greater than 1000.
trendy_names = trendy_names[trendy_names.total >= 1000]
Now, I want to sort the dataframe on the basis of "trendiness" column. Any thoughts?
To sort the dataframe on the basis of "trendiness" which is type: DataFrameGroupBy
1. trendy_names.reset_index()
reset_index() - converting back to a regular index
i.e converting pandas.core.groupby.DataFrameGroupBy to pandas.core.frame.DataFrame
2. trendy_names.sort_values(by = 'trendiness')

Group by a column and return multiple aggregates as a dataframe

I have a csv having multiple columns.
As an example, here is the header and the first 2 rows of the file:
ACC;SYM;SumRealPNL;Count;MinAVG;PerLotPNL;SumOneLotPNL;ProfitOnly;ProfitOnlyCount;ProfitOnlyMinAVG;LossOnly;LossOnlyCount;LossOnlyMinAVG;Period;-;P;Q;R;S;Total;U;AS;W;YEAH;Y
31942;EURUSD;4.593,00;17;730;336,47;5.720,00;5.720,00;17;730;0,00;0;0;4;;1;2;0;1;4;A;31942EURUSD1;12;16;18
34887;XAUUSD;16.150,00;7;276;588,43;4.119,00;4.119,00;7;276;0,00;0;0;4;;1;2;0;1;4;A;34887XAUUSD1;12;16;18
I load the csv file to a dataframe:
df = pd.read_csv('aaaa.csv', header=0, sep=';')
I grouped the dataframe by AS column:
byAS = df.groupby('AS')
Now I want to create a new dataframe having the following columns using the DataFrameGroupBy object (byAS):
AS column
First value of ACC column
First value of U column
Average of PerLotPNL column
Sum of SumOneLotPNL column
Sum of Y column
How can I do that?
Once you have your dataframe df and group on the AS column as you have already in your post, you can use the agg function to obtain the desired output.
byAS = df.groupby('AS')
result = byAS.agg({'ACC': 'first',
'U': 'first',
'PerLotPNL': np.mean,
'SumOneLotPNL': np.sum,
'Y': np.sum}).reset_index(inplace=True)

Python Pandas Timeseries Sum Daily Column Data

I'm stuck trying to figure out how to sum one of the columns in my dataframe based on day/month/year etc. I don't want to perform the aggregation on the other columns. As the dataframe will become shorter, I would like to use the minimum value from the other columns of the dataframe.
This is what I have, but it does not produce what I want. It only sums the first and last part and then gives me NaN values for the rest.
df = pd.DataFrame(zip(points, data, junk), columns=['Dates', 'Data', 'Junk'])
df.set_index('Dates', inplace=True)
_add = {'Data': np.sum, 'Junk': np.min}
newdf = df.resample('D', how=_add)
Thanks

Categories