Python: outputting lists to excel - python

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()

To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Related

Filtering a CSV File using two columns

I am a newbie to python. I am working on a CSV file where it has over a million records. In the data, every Location has a unique ID (SiteID). I want to filter for and remove any records where there is no value or mismatch between SiteID and Location in my CSV file. (Note: This script should print the lines number and mismatch field values for each record.)
I have the following code. Please help me out:
import pandas as pd
pd = pandas.read_csv ('car-auction-data-from-ghana', delimiter = ";")
pd.head()
date_time = (pd['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
comparison_column = pd.where(pd['SiteID'] == pd['Location'], True, False)
comparison_column
This should be your solution:
df = pd.read_csv('car-auction-data-from-ghana', delimiter = ";")
print(df.head())
date_time = (df['Date Time'] >= '2010-01-01T00:00:00+00:00') #to filter from a specific date
df = df[df['SiteID'] == df['Location']]
print(df)
You need to call read_csv as a member of pd because it is the alias to the imported package, and use df as the variable for your data frame. The line with the comparison drops rows in which the boolean is not equal, the two not being equal in this case.

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Python How to convert collections.OrderedDict to dataFrame

I have the following task:
1) I have an excel file with a few spreadsheets. From these spreadsheets I need information from columns "A:CU", rows 41 - 51
2) Then I need to collect information from from columns "A:CU", rows 41 - 51 from all spreadsheets (they have the same structure) and to create a database.
3) There should be a column that indicates from which spreadsheet data was collected
I did following:
import pandas as pd
file='January2020.xlsx'
#getting info from spreadsheets C(1), C(2) and so on
days = range(1,32)
sheets = []
for day in days:
sheets.append('C(' + str(day)+')')
#importing data
all_sales=pd.read_excel(file,header=None,skiprows=41, usecols="A:CU", sheet_name=sheets,
skipfooter=10)
Now I have collections.OrderedDict and struggle to put it into dataFrame.
What I need to have is a dataframe like this:
Try pd.concat
df = pd.concat(all_sales, ignore_index = True)
I used this code and it worked:
file='January2020.xlsx'
days = range(1,32)
all_sales=pd.DataFrame()
df = pd.DataFrame()
all_df = []
for day in days:
sheet_name = "C("+str(day)+")"
all_sales=pd.read_excel(file,header=None,skiprows=41,usecols="A:CU", sheet_name=sheet_name,
skipfooter=10)
all_sales["Date"] = sheet_name
all_df.append(all_sales)
df_final = pd.concat(all_df)

DataFrame Split On Rows and apply on header one column using Python Pandas

I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.

python pandas exported csv format different from imported issue

I have a strange issue on pandas.read_csv function. I exported my dataframe into a csv, but when I re-imported the same csv, the data that has been imported back does not work when I try to merge(The merge shows all the data on the left and none that I have tried to merge it with). If I use the original data before it was exported to the csv, it works completely fine.(The merge was perfect).
df = df.values_list('id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp', flat=False)
#inserting the collected data into a dateframe for manipulation
df = pd.DataFrame(list(df))
#giving the dataframe column names
df.columns = ['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date','updated','timestamp']
df = df[['id','teacher_id','uniquecount','nonuniquecount','msgcount','ordercount','date']]
#rename required columns
df.rename(columns={'uniquecount':'Unique Views','nonuniquecount':'Views','msgcount':'Messages','ordercount':'Orders'}, inplace=True)
print df
print df.dtypes
# exporting df out to a csv
# df.to_csv('test.csv', header=True)
# importing the df back from a csv
df = pd.read_csv('test.csv', index_col=0)
print df
print df.dtypes
#insert dates
numdays = 14
base = datetime.datetime.today().date()
date_list = [base - datetime.timedelta(days=x) for x in range(0, numdays)]
dates = pd.DataFrame(date_list)
dates.columns = ['date']
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')
# print df
I have checked and compared that the dataframes look exactly the same before export and after importing from the csv.(I printed the output twice, once before export and one after) I have also checked and the datetypes are all the same.
I need to export the csv to work with an external environment because I cant attach my local database.
attached a copy of the cmdline print which shows that both dataframes are exactly similar
attached below is a sample of my exported csv
,id,teacher_id,Unique Views,Views,Messages,Orders,date
0,47,31,1,6,0,0,2017-05-09
1,56,31,1,9,0,0,2017-05-10
2,67,31,2,11,0,0,2017-05-14
3,71,31,3,15,0,0,2017-05-15
4,79,31,3,17,0,0,2017-06-12
5,83,31,3,18,0,1,2017-06-18
Does anyone have any idea on this strange issue?
Before calling merge, try converting both dates using to_datetime first as referred in answer here
df.date = pd.to_datetime(df.date)
dates.date = pd.to_datetime(dates.date)
#merge the complete dates with the dateframe
df = pd.merge(dates ,df , on=['date'] , how='left')

Categories