Python: Manipulate dataframe in user function - python

I want to manipulate a dataframe in a user written function in python. The manipulating code works fine when I run it outside the function. However, when I put it in the function and run the function it runs without an error but not returns any dataframe. My code looks the following:
def reshape(file):
from IPython import get_ipython
get_ipython().magic('reset -sf')
#import packages
import pandas as pd
import datetime
import calendar
#define file path and import files
path="X:/TEMP/"
file_path =path+file
df = pd.read_excel(file_path, "Sheet1", parse_dates=["Date"])
#reshape data to panel
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted["id"] = melted["id"].str.replace("id", "")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
id_to_string = pd.read_excel(file_path, "Sheet2")
id_to_string = id_to_string.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
merged = pd.merge(melted, id_to_string, how="left", on="id")
merged = merged.sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
df = pd.read_excel(file_path, "hardcopy_return", parse_dates=["Date"])
df = df.sort("Date", ascending=1)
old = merged
merged = pd.merge(old,df, on=["Date", "id"])
merged = merged.set_index("Date")
return merged
reshape("sample.xlsx")
This code runs through but returns nothing. Did I made a mistake in the def command or by calling the function? Any help is greatly appreciated.

I assume this is run with iPython or a jupyter notebook?
It might have worked before because the kernel remembers some state. Before making something into a seperate function instead of a straight script, I do a restart kernel & run All
On the code itself, I would split the different parts of the code, so it becomes easier to test the individual parts
Imports
import pandas as pd
import datetime
import calendar
from IPython import get_ipython
get_ipython().magic('reset -sf')
Read 'Sheet1'
get the data from the first worksheet and do the first processing
def read_melted(file_path):
df1 = pd.read_excel(file_path, sheetname='Sheet1', parse_date["Date"])
melted = pd.melt(df,id_vars="Date", var_name="id", value_name="Market_Cap")
melted.id = melted.id.astype(int)
melted.reset_index(inplace=True, drop=True)
return melted
Read 'Sheet2'
def read_id_to_spring(file_path):
df2 = pd.read_excel(file_path, sheetname='Sheet2')
id_to_string = id2.transpose()
id_to_string.reset_index(level=0, inplace=True)
id_to_string.rename(columns = {0: 'id'}, inplace=True)
id_to_string.rename(columns = {"index": 'Ticker'}, inplace=True)
return id_to_string
Read 'hardcopy_return'
def read_hardcopy_return(file_path):
df = pd.read_excel(file_path, sheetname='hardcopy_return', parse_date["Date"])
return df.sort("Date", ascending=1)
Tying it together
def reshape(df1, df2, df_hardcopy_return):
merged = pd.merge(df1, df2, how="left", on="id").sort(["Date","Market_Cap"], ascending=[1,0])
merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True) # what does this line do?
merged_all = pd.merge(merged,df_hardcopy_return, on=["Date", "id"]).set_index("Date")
return merged_all
Calling everything
path="X:/TEMP/"
file_path =path+file
df1 = read_melted(file_path)
df2 = read_id_to_spring(file_path)
df_hardcopy_return = read_hardcopy_return(file_path)
reshape(df1, df2, df_hardcopy_return)
The only thing that still strikes me as odd is the line merged["Rank"] = merged.groupby(["Date"])["Market_Cap"].rank(ascending=True)
read_excel sheetname
pandas.read_excel also has a sheetname argument, which you can use to only open the excelfile once. Reading excel files can be slow sometimes, so this might make it faster too

Related

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Using Classes in Pandas

Im trying to write better more easy to read code and Ive begun starting to use classes..
Confusing thus far, but I can see the positives..
That said im simply trying to merge 2 dataframes...
Previously I've achieved this by using...
import pandas as pd
path = 'path/to/file.xlsx'
df = pd.read_excel(path, 'sheet1')
df2 = pd.read_excel(path, 'sheet2')
df3 = df.merge(df2, how = 'left, on = 'column1')
Trying to implement using classes I have this thus far. Which could be incorrect.....?
import pandas as pd
path = 'path/to/file.xlsx'
class CreateOpFile:
def __init__(self, path):
self.df = pd.read_excel(path, 'sheet1')
self.df2 = pd.read_excel(path, 'sheet2')
def MergeDataFrames(self):
pd.merge(self.df, self.df2, how = 'left', on= 'column1')
So im confused as to how I create a new variable, lets say df3 outside of the class CreateOpFile as I have done using df3 = df.merge(df2, how = 'left, on = 'column1') in the first method ?
One way to do it is to return the new, merged df and assign it to the df3 outside of the class.
import pandas as pd
path = 'path/to/file.xlsx'
class CreateOpFile:
def __init__(self, path):
self.df = pd.read_excel(path, 'sheet1')
self.df2 = pd.read_excel(path, 'sheet2')
def MergeDataFrames(self):
return pd.merge(self.df, self.df2, how = 'left', on= 'column1')
df3 = CreateOpFile(path).MergeDataFrames()
Btw, according to the naming conventions mentioned in PEP-8, method names should consist of lowercase letters and underscores to separate words. Therefore merge_data_frames() seems to be a better naming.

Defining function to open an Excel file (openpyxl) and save as a DataFrame

I have a typical method that I use to pull data from an Excel file into a DataFrame:
import pandas as pd
import openpyxl as op
path = r'thisisafilepath\filename.xlsx'
book = op.load_workbook(filename=path, data_only=True)
tab = book['sheetname']
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
I'm trying to define this method as a function to make the code simpler/more readable.
I have tried the following:
def openthis(path, sheet):
book = op.load_workbook(filename=path, data_only=True)
tab = book[sheet]
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
return df
When I then call openthis() the output is a printed version of the DataFrame in my console, but no variable has actually been created for me to work with.
What am I missing? Also, is there a way to define what the DataFrame variable is called when it is produced?
You didn't show your actual implementation of calling it but I'm guessing that you didn't assign the output to a variable.
Notice in your function return df.
This statement means when you call openthis() it outputs a variable. Unless you assign that output to a local variable, its gone forever.
Try this
df = openthis(some_arguments)

Excel file overwritten instead of concat - Python - Pandas

I'm trying to contact all excel files and worksheets in them into one using the below script. It kinda works but then the excel file c.xlsx is overwritten per file, so only the last excel file is concated not sure why?
import pandas as pd
import os
import ntpath
import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(dir_path)
cdf = None
for excel_names in glob.glob('*.xlsx'):
print(excel_names)
df = pd.read_excel(excel_names, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
cdf.to_excel("c.xlsx", header=False, index=False)
Idea is create list of DataFrames in list comprehension, but because working with orderdict is necessary concat in loop and then again concat for one big final DataFrame:
cdf = [pd.read_excel(excel_names, sheet_name=None, ignore_index=True).values()
for excel_names in glob.glob('files/*.xlsx')]
df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
#print (df)
df.to_excel("c.xlsx", index=False)
I just tested the code below. It merges data from all Excel files in a folder into one, single, Excel file.
import pandas as pd
import numpy as np
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("C:\\your_path\\final.xlsx", sheet_name='Sheet1')
I got it working using the below script which uses #ryguy72's answer but works on all worksheets as well as the header row.
import pandas as pd
import numpy as np
import glob
all_data = pd.DataFrame()
for f in glob.glob("my_path/*.xlsx"):
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
cdf = pd.concat(df.values())
all_data = all_data.append(cdf,ignore_index=True)
print(all_data)
df = pd.DataFrame(all_data)
df.shape
df.to_excel("my_path/final.xlsx", sheet_name='Sheet1')

Append only unlike data from one .csv to another .csv

I have managed to use Python with the speedtest-cli package to run a speedtest of my Internet speed. I run this every 15 min and append the results to a .csv file I call "speedtest.csv". I then have this .csv file emailed to me every 12 hours, which is a lot of data.
I am only interested in keeping the rows of data that return less than 13mbps Download speed. Using the following code, I am able to filter for this data and append it to a second .csv file I call speedtestfilteronly.csv.
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
df.to_csv('c:\speedtestfilteronly.csv', mode='a', header=False)
The problem now is it appends all the rows that match my filter criteria every time I run this code. So if I run this code 4 times, I receive the same 4 sets of appended data in the "speedtestfilteronly.csv" file.
I am looking to only append unlike rows from speedtest.csv to speedtestfilteronly.csv.
How can I achieve this?
I have got the following code to work, except the only thing it is not doing is filtering the results to < 13000000.0 mb/s: Any other ideas?
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtest.csv')
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\emailspeedtest.csv', header=None, index=False)
There's a few different way you could approach this, one would be to read in your filtered dataset, append the new one in memory and then drop duplicates like this:
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtestfilteronly.csv', header=None)
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\speedtestfilteronly.csv', header=None, index=False)

Categories