Creating a function to describe a set of dataframes - python

I want to create a function which will give the output of df.describe for all the dataframes which is passed to the function argument.
My idea was to store all the dataframe(whom i need to describe) names as columns in a seperate dataframe (x) and then pass this to the function.
Here is what i have made and the output :
The problem is that its only showing description of only one dataframe
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
return(column.describe())
data = {'UGCAA':[],'FAPG1':[],'ACSO5':[],'LGHF2':[],'LGMP8':[],'GGAF1':[]}
df=pd.DataFrame(data)
df
des(df)
Sales
count 948.000000
mean 876.415612
std 874.373236
min 1.000000
25% 298.750000
50% 619.500000
75% 1148.500000
max 7345.00000

I believe you can create list of DataFrames and last concat together:
def des(df):
dfs = []
for column in df.columns:
df1=pd.read_csv('SKUs\\'+column+'.csv')
df1['Date'] = pd.to_datetime(df1['Date'].astype(str),
format ='%d%m%y',infer_datetime_format=True)
df1.dropna(inplace=True)
dfs.append(df1.describe())
return pd.concat(dfs, axis=1, keys=df.columns)

It's because you are looping over and reseting column each time while only returning one. To just visualize it, you can just print the describe in each loop, or store them together in one variable and handle it after the loop.
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
print(column.describe())

Related

Apply function to multiple dataframes and create multiple dataframes from that

I have a list of multiple data frames on cryptocurrency. So I want to apply a function to all of these data frames, which should convert all the data frames so that I am only left with data from 2021.
The function looks like this:
dataframe_list = [bitcoin, have, binance, Cardano, chainlink, cosmos, crypto com, dogecoin, eos, Ethereum, iota, litecoin, monero, nem, Polkadot, Solana, stellar, tether, uni swap, usdcoin, wrapped, xrp]
def date_func(i):
i['Date'] = pd.to_datetime(i['Date'])
i = i.set_index(i['Date'])
i = i.sort_index()
i = i['2021-01-01':]
return(i)
for dataframe in dataframe_list:
dataframe = date_func(dataframe)
However, I am only left with one data frame called 'dataframe', which only contains values of the xrp dataframe.
I would like to have a new dataframe from each dataframe, called aave21, bitcoin21 .... which only contains values from 2021 onwards.
What am I doing wrong?
Best regards and thanks in advance.
You are overwriting dataframe when iterating over dataframe_list, i.e. you only keep the latest dataframe.
You can either try:
dataframe = pd.DataFrame()
for df in dataframe_list:
dataframe.append(date_func(df))
Or shorter:
dataframe = pd.concat([data_func(df) for df in dataframe_list])
You are overwriting dataframe variable in your for loop when iterating over dataframe_list. You need to keep appending results into a new variable.
final_df = pd.DataFrame()
for dataframe in dataframe_list:
final_df.append(date_func(dataframe))
print(final_df)

Create a new column to data frame from existing columns using Pandas

I am not sure how to build a data frame here but I am looking for a way to take the data from multiple columns and combine them into 1 column. Not as a sum but as a joined value.
Ex. MB|Val|34567|W123 -> MB|Val|34567|W123|MB_Val_34567_W123.
What I have tried so far is creating a conditions variable that calls a particular column identical to the value in it
conditions = [(Groupings_df['GroupingCriteria1'] == 'MB')]
then a values variable that would include what I want in the new column
values = ['MB_Val_34567_W123']
and lastly grouping it
Groupings_df['GroupingColumn'] = np.select(conditions,values)
This works for 1 row but it would be inefficient to keep manually changing the number in the values variable (34567) over a df with thousands of rows
IIUC, you want to create a new column as a concatenation of each row:
df = pd.DataFrame({'GC1': ['MB'], 'GC2': ['Val'], 'GC3': [34567], 'GC4': ['W123'],
'Dummy': [10], 'Other': ['Hello']})
df['GC'] = df.filter(like='GC').astype(str).apply(lambda x: '_'.join(x), axis=1)
print(df)
# Output
GC1 GC2 GC3 GC4 Dummy Other GC
0 MB Val 34567 W123 10 Hello MB_Val_34567_W123

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Create mutliple dataframes by changing one column with a for loop?

I am calculating heat decay from spent fuel rods using variable cooling times. How can I create multiple dataframes, by varying the cooling time column with a for loop, then write them to a file?
Using datetime objects, I am creating multiple columns of cooling time values by subtracting a future date from the date the fuel rod was discharged.
I then tried to use a for loop to index these columns into a new dataframe, with the intent to streamline multiple files by using newly created dataframes in a new function.
df = pd.read_excel('data')
df.columns = ['ID','Enr','Dis','Mtu']
# Discharge Dates
_0 = dt.datetime(2020,12,1)
_1 = dt.datetime(2021,6,1)
_2 = dt.datetime(2021,12,1)
_3 = dt.datetime(2022,6,1)
# Variable Cooling Time Columns
df['Ct_0[Years]'] = df['Dis'].apply(lambda x: (((_0 - x).days)/365))
df['Ct_1[Years]'] = df['Dis'].apply(lambda x: (((_1 - x).days)/365))
df['Ct_2[Years]'] = df['Dis'].apply(lambda x: (((_2 - x).days)/365))
df['Ct_3[Years]'] = df['Dis'].apply(lambda x: (((_3 - x).days)/365))
# Attempting to index columns into new data frame
for i in range(4):
df = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
tfile = open('Inventory_FA_%s.prn'%i,'w')
### Apply conditions for flagging
tfile.close()
I was expecting the created cooling time columns to be indexed into the newly defined dataframe df. Instead I received the following error;
KeyError: "['Ct_1[Years]'] not in index"
Thank you for the help.
You are overwriting your dataframe in each iteration of your loop with the line:
df = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
which is why you are fine on your first iteration (error doesn't say anything about 'Ct_0[Years]' not being in the index), and then die on your second iteration. You've dropped everything but the columns you selected in your first iteration. Select your columns into a temporary df instead:
for i in range(4):
df_temp = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
tfile = open('Inventory_FA_%s.prn'%i,'w')
### Apply conditions for flagging using df_temp
tfile.close()
Depending on what your conditions are, there might be a better way to do this that doesn't require making a temporary view into the dataframe, but this should help.
Why are you creating a new dataframe? is it only to reorganize/drop columns?.Engineero is right you are effectively rewriting df on each iteration.
Anyways you could try:
dfnew = pd.Dataframe()
dfnew = df[['ID','Mtu','Enr']]
for i in range(4):
dftemp = df[['Ct_%i[Years]'%i]]
dfnew.join(dftemp)

Dataframe reverse for drop(column = )

I'm trying to manipulate a dataframe using a cumsum function.
My data looks like this:
To perform my cumsum, I use
df = pd.read_excel(excel_sheet, sheet_name='Sheet1').drop(columns=['Material']) # Dropping material column
I run the rest of my code, and get my expected outcome of a dataframe cumsum without the material listed:
df2 = df.as_matrix() #Specifying Array format
new = df2.cumsum(axis=1)
print(new)
However, at the end, I need to replace this material column. I'm unsure how to use the add function to get this back to the beginning of the dataframe.
IIUC, then you can just set the material column to the index, then do your cumsum, and put it back in at the end:
df2 = df.set_index('Material').cumsum(1).reset_index()
An alternative would be to do your cumsum on all but the first column:
df.iloc[:,1:] = df.iloc[:,1:].cumsum(1)

Categories