Using Classes in Pandas - python

Im trying to write better more easy to read code and Ive begun starting to use classes..
Confusing thus far, but I can see the positives..
That said im simply trying to merge 2 dataframes...
Previously I've achieved this by using...
import pandas as pd
path = 'path/to/file.xlsx'
df = pd.read_excel(path, 'sheet1')
df2 = pd.read_excel(path, 'sheet2')
df3 = df.merge(df2, how = 'left, on = 'column1')
Trying to implement using classes I have this thus far. Which could be incorrect.....?
import pandas as pd
path = 'path/to/file.xlsx'
class CreateOpFile:
def __init__(self, path):
self.df = pd.read_excel(path, 'sheet1')
self.df2 = pd.read_excel(path, 'sheet2')
def MergeDataFrames(self):
pd.merge(self.df, self.df2, how = 'left', on= 'column1')
So im confused as to how I create a new variable, lets say df3 outside of the class CreateOpFile as I have done using df3 = df.merge(df2, how = 'left, on = 'column1') in the first method ?

One way to do it is to return the new, merged df and assign it to the df3 outside of the class.
import pandas as pd
path = 'path/to/file.xlsx'
class CreateOpFile:
def __init__(self, path):
self.df = pd.read_excel(path, 'sheet1')
self.df2 = pd.read_excel(path, 'sheet2')
def MergeDataFrames(self):
return pd.merge(self.df, self.df2, how = 'left', on= 'column1')
df3 = CreateOpFile(path).MergeDataFrames()
Btw, according to the naming conventions mentioned in PEP-8, method names should consist of lowercase letters and underscores to separate words. Therefore merge_data_frames() seems to be a better naming.

Related

Is there a more elegant way to initialize empty dataframes

isd = pd.DataFrame()
ind = pd.DataFrame()
exd = pd.DataFrame()
psd = pd.DataFrame()
visd = pd.DataFrame()
vind = pd.DataFrame()
vexd = pd.DataFrame()
sd = pd.DataFrame()
ise = pd.DataFrame()
idb = pd.DataFrame()
mdd = pd.DataFrame()
add = pd.DataFrame()
Is there any alternate way to make it elegant and faster?
Use a dictionary of dataframes, especially of the code for some data frames is going to share some similarities. This will allows doing some operations using loops or functions:
dct = {n: pd.DataFrame() for n in ['isd', 'ind', 'exd']}
If you want to avoid needing to numerically index each of the DataFrames, but would rather be able to access them directly by their name:
import pandas as pd
table_names = ['df1', 'df2', 'df3']
for name in table_names:
exec('%s = pd.DataFrame()' % name, locals(), locals())
print(df1)
This approach uses exec, which essentially runs a string as if it were python code. I'm just formatting each of the predetermined names into the string in a for-loop.
You can do something like this:
dfs = ['isd', 'ind', 'exd']
df_list = [pd.DataFrame() for df in dfs ]
I think you can go this way
import pandas as pd
a, b, c, d = [pd.DataFrame()]*4

Defining function to open an Excel file (openpyxl) and save as a DataFrame

I have a typical method that I use to pull data from an Excel file into a DataFrame:
import pandas as pd
import openpyxl as op
path = r'thisisafilepath\filename.xlsx'
book = op.load_workbook(filename=path, data_only=True)
tab = book['sheetname']
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
I'm trying to define this method as a function to make the code simpler/more readable.
I have tried the following:
def openthis(path, sheet):
book = op.load_workbook(filename=path, data_only=True)
tab = book[sheet]
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
return df
When I then call openthis() the output is a printed version of the DataFrame in my console, but no variable has actually been created for me to work with.
What am I missing? Also, is there a way to define what the DataFrame variable is called when it is produced?
You didn't show your actual implementation of calling it but I'm guessing that you didn't assign the output to a variable.
Notice in your function return df.
This statement means when you call openthis() it outputs a variable. Unless you assign that output to a local variable, its gone forever.
Try this
df = openthis(some_arguments)

Save file with a specific name using tk filedialog

Is there a way to save a DataFrame into an excel file with filedialog but, using a specific name as 'my_file' for example?
I usually use this code
path_to_save = filedialog.asksaveasfilename(defaultextension='.xlsx')
df.to_excel(path_to_save, index=False)
and this opens a window where I can choose the location and name of my file just that, now I want to have the name 'my_file' by default so that typing it will not be necessary.
Is there a way of doing it?Many thanks in advance
The excel file saved is empty:
a_row['column1'] = df['column1']
new_df = a_row
new_df2 = pd.DataFrame({'column2': [], '': []})
new_df3 = pd.concat([new_df, new_df2])
new_df3['column2'] = 'some value'
new_df3 = new_df3.set_index(['column1', 'column2'])
path_to_save1 = filedialog.asksaveasfilename(defaultextension='.xlsx', initialfile = 'my_file')
new_df3.to_excel(path_to_save1, index=False)
Is there maybe away to insert a row on the top of columns name like in this image?I couldn't find anything in pandas doc about this
Main Question
Try using the parameter initialfile for the asksaveasfilename function, e.g.:
path_to_save = filedialog.asksaveasfilename(defaultextension='.xlsx', initialfile = 'my_file')
Comment Question
Regarding the DataFrame emptiness, is because you are not assigning anything to it. To add values, you can use loc:
df = pd.DataFrame({'column1':[],'column2':[]})
df.loc[0] = ['value1','value2']
You can also do that using concat, but make sure your dataframes have the same number of columns for that.
And to add a row on top, there was this interesting solution by #edyvedy13 found here:
df.loc[-1] = ['value1.0','value2.0']
df.insert(0,1,{'value2'})
df.index = df.index + 1
df.sort_index(inplace=True)

Python: Manipulate dataframe in user function

I want to manipulate a dataframe in a user written function in python. The manipulating code works fine when I run it outside the function. However, when I put it in the function and run the function it runs without an error but not returns any dataframe. My code looks the following:
def reshape(file):
from IPython import get_ipython
get_ipython().magic('reset -sf')
#import packages
import pandas as pd
import datetime
import calendar
#define file path and import files
path="X:/TEMP/"
file_path =path+file
df = pd.read_excel(file_path, "Sheet1", parse_dates=["Date"])
#reshape data to panel
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted["id"] = melted["id"].str.replace("id", "")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
id_to_string = pd.read_excel(file_path, "Sheet2")
id_to_string = id_to_string.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
merged = pd.merge(melted, id_to_string, how="left", on="id")
merged = merged.sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
df = pd.read_excel(file_path, "hardcopy_return", parse_dates=["Date"])
df = df.sort("Date", ascending=1)
old = merged
merged = pd.merge(old,df, on=["Date", "id"])
merged = merged.set_index("Date")
return merged
reshape("sample.xlsx")
This code runs through but returns nothing. Did I made a mistake in the def command or by calling the function? Any help is greatly appreciated.
I assume this is run with iPython or a jupyter notebook?
It might have worked before because the kernel remembers some state. Before making something into a seperate function instead of a straight script, I do a restart kernel & run All
On the code itself, I would split the different parts of the code, so it becomes easier to test the individual parts
Imports
import pandas as pd
import datetime
import calendar
from IPython import get_ipython
get_ipython().magic('reset -sf')
Read 'Sheet1'
get the data from the first worksheet and do the first processing
def read_melted(file_path):
df1 = pd.read_excel(file_path, sheetname='Sheet1', parse_date["Date"])
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
return melted
Read 'Sheet2'
def read_id_to_spring(file_path):
df2 = pd.read_excel(file_path, sheetname='Sheet2')
id_to_string = id2.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
return id_to_string
Read 'hardcopy_return'
def read_hardcopy_return(file_path):
df = pd.read_excel(file_path, sheetname='hardcopy_return', parse_date["Date"])
return df.sort("Date", ascending=1)
Tying it together
def reshape(df1, df2, df_hardcopy_return):
merged = pd.merge(df1, df2, how="left", on="id").sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) # what does this line do?
merged_all = pd.merge(merged,df_hardcopy_return, on=["Date", "id"]).set_index("Date")
return merged_all
Calling everything
path="X:/TEMP/"
file_path =path+file
df1 = read_melted(file_path)
df2 = read_id_to_spring(file_path)
df_hardcopy_return = read_hardcopy_return(file_path)
reshape(df1, df2, df_hardcopy_return)
The only thing that still strikes me as odd is the line merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
read_excel sheetname
pandas.read_excel also has a sheetname argument, which you can use to only open the excelfile once. Reading excel files can be slow sometimes, so this might make it faster too

pandas Combine Excel Spreadsheets

I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)

Categories