The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.
Related
Running dataframe.to_excel() automatically saves the dataframe as the last sheet in the Excel file.
Is there a way to save a dataframe as the very first sheet, so that, when you open the spreadsheet, Excel shows it as the first on the left?
The only workaround I have found is to first export an empty dataframe to the tab with the name I want as first, then export the others, then export the real dataframe I want to the tab with the name I want. Example in the code below. Is there a more elegant way? More generically, is there a way to specifically choose the position of the sheet you are exporting to (first, third, etc)?
Of course this arises because the dataframe I want as first is the result of some calculations based on all the others, so I cannot export it.
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
pd.DataFrame().to_excel(writer,'this should be the 1st')
other_df.to_excel(writer,'other df')
first_df.to_excel(writer,'this should be the 1st')
writer.save()
writer.close()
It is possible to re-arrange the sheets after they have been created:
import pandas as pd
import numpy as np
writer = pd.ExcelWriter('My excel test.xlsx')
first_df = pd.DataFrame()
first_df['x'] = np.arange(0,100)
first_df['y'] = 2 * first_df['x']
other_df = pd.DataFrame()
other_df['z'] = np.arange(100,201)
other_df.to_excel(writer,'Sheet2')
first_df.to_excel(writer,'Sheet1')
writer.save()
This will give you this output:
Add this before you save the workbook:
workbook = writer.book
workbook.worksheets_objs.sort(key=lambda x: x.name)
I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns
I have managed to use Python with the speedtest-cli package to run a speedtest of my Internet speed. I run this every 15 min and append the results to a .csv file I call "speedtest.csv". I then have this .csv file emailed to me every 12 hours, which is a lot of data.
I am only interested in keeping the rows of data that return less than 13mbps Download speed. Using the following code, I am able to filter for this data and append it to a second .csv file I call speedtestfilteronly.csv.
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
df.to_csv('c:\speedtestfilteronly.csv', mode='a', header=False)
The problem now is it appends all the rows that match my filter criteria every time I run this code. So if I run this code 4 times, I receive the same 4 sets of appended data in the "speedtestfilteronly.csv" file.
I am looking to only append unlike rows from speedtest.csv to speedtestfilteronly.csv.
How can I achieve this?
I have got the following code to work, except the only thing it is not doing is filtering the results to < 13000000.0 mb/s: Any other ideas?
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtest.csv')
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\emailspeedtest.csv', header=None, index=False)
There's a few different way you could approach this, one would be to read in your filtered dataset, append the new one in memory and then drop duplicates like this:
import pandas as pd
df = pd.read_csv('c:\speedtest.csv', header=0)
df = df[df['Download'].map(lambda x: x < 13000000.0,)]
history_df = pd.read_csv('c:\speedtestfilteronly.csv', header=None)
master_df = pd.concat([history_df, df], axis=0)
new_master_df = master_df.drop_duplicates(keep="first")
new_master_df.to_csv('c:\speedtestfilteronly.csv', header=None, index=False)
For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')
I have an Excel workbook with many tabs.
Each tab has the same set of headers as all others.
I want to combine all of the data from each tab into one data frame (without repeating the headers for each tab).
So far, I've tried:
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
df = xl.parse()
Can use something for the parse argument that will mean "all spreadsheets"?
Or is this the wrong approach?
Thanks in advance!
Update: I tried:
a=xl.sheet_names
b = pd.DataFrame()
for i in a:
b.append(xl.parse(i))
b
But it's not "working".
This is one way to do it -- load all sheets into a dictionary of dataframes and then concatenate all the values in the dictionary into one dataframe.
import pandas as pd
Set sheetname to None in order to load all sheets into a dict of dataframes
and ignore index to avoid overlapping values later (see comment by #bunji)
df = pd.read_excel('tmp.xlsx', sheet_name=None, index_col=None)
Then concatenate all dataframes
cdf = pd.concat(df.values())
print(cdf)
import pandas as pd
f = 'file.xlsx'
df = pd.read_excel(f, sheet_name=None, ignore_index=True)
df2 = pd.concat(df, sort=True)
df2.to_excel('merged.xlsx',
engine='xlsxwriter',
sheet_name=Merged,
header = True,
index=False)