Python How to convert collections.OrderedDict to dataFrame - python

I have the following task:
1) I have an excel file with a few spreadsheets. From these spreadsheets I need information from columns "A:CU", rows 41 - 51
2) Then I need to collect information from from columns "A:CU", rows 41 - 51 from all spreadsheets (they have the same structure) and to create a database.
3) There should be a column that indicates from which spreadsheet data was collected
I did following:
import pandas as pd
file='January2020.xlsx'
#getting info from spreadsheets C(1), C(2) and so on
days = range(1,32)
sheets = []
for day in days:
sheets.append('C(' + str(day)+')')
#importing data
all_sales=pd.read_excel(file,header=None,skiprows=41, usecols="A:CU", sheet_name=sheets,
skipfooter=10)
Now I have collections.OrderedDict and struggle to put it into dataFrame.
What I need to have is a dataframe like this:

Try pd.concat
df = pd.concat(all_sales, ignore_index = True)

I used this code and it worked:
file='January2020.xlsx'
days = range(1,32)
all_sales=pd.DataFrame()
df = pd.DataFrame()
all_df = []
for day in days:
sheet_name = "C("+str(day)+")"
all_sales=pd.read_excel(file,header=None,skiprows=41,usecols="A:CU", sheet_name=sheet_name,
skipfooter=10)
all_sales["Date"] = sheet_name
all_df.append(all_sales)
df_final = pd.concat(all_df)

Related

Pandas Only Exporting 1 Table to Excel but Printing all

The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.

Error when trying to read multiple sheets from excel file

In the following code I'm trying to write multiple sheets from an Excel file,remove the empty cells, group the columns and store the result in another excel file:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
df = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
When I execute my script iget the following error:
AttributeError: 'dict' object has no attribute 'dropna'
how can I fix it?
When you provide a list of sheets for sheet_name param, your return object is dict of DataFrame as described here
dropna is method of DataFrame so you have to select the sheet first. for example
df['R9_14062021'].dropna(inplace=True)
Taken from pandas documentation for pd.read_excel:
If you give sheet_name a list, you will receive a list of dataframes.
Meaning you'll have to go over each dataframe and dropna() separately because you can't dropna() on a dictionary, your code will look like this:
import pandas as pd
sheets = ['R9_14062021','LOGS R9','LOGS R7 01032021']
dfs_list = pd.read_excel('LOGS.xlsx',sheet_name = sheets )
for i in dfs_list:
df = dfs_list[i]
df.dropna(inplace = True)
df['Dt'] = pd.to_datetime(df['Dt']).dt.date
df1 = df.groupby(['Dt','webApp','mw'])['chgtCh','accessRecordModule','playerPlay
startOver','playerPlay PdL','playerPlay
PVR','contentHasAds','pdlComplete','lirePdl','lireVod'].sum()
df1.reset_index(inplace=True)
df1.to_excel(r'logs1.xlsx', index = False)
The main difference here is the usage of
for i in dfs_list:
df = dfs_list[i]
in order to apply each change you are doing to each dataframe, if you want a specific dataframe you should do: df[0].dropna() for example.
Hope this helps and this is what you were aiming for.

Reading multiple CSV files in Spark and make a DataFrame

I am using following code to read multiple csv files and and converting them to pandas df then concat it as a single pandas df. Finally converting again into spark DataFrame. I want to skip conversion to pandas df part and simply want to have spark DataFrame.
File Paths
abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=1/*.csv
abfss://xxxxxx/abc/year=2021/month=1/dayofmonth=1/hour=2/*.csv
......
Code
list = []
for month in range(1,3,1):
for day in range(1,31,1):
for hour in range(0,24,1):
file_location = "//xxxxxx/abc/year=2021/month="+str(month)+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"
try :
spark_df = spark.read.format("csv").option("header", "true").load(file_location)
pandas_df = spark_df.toPandas()
list.append(pandas_df)
except AnalysisException as e:
print(e)
final_pandas_df = pd.concat(list)
df = spark.createDataFrame(final_pandas_df)
You can load all the files and apply a filter on the partitioning columns:
df = spark.read.format("csv").option("header", "true").load("abfss://xxxxxx/abc/").filter(
'year = 2021 and month between 1 and 2 and day between 1 and 30 and hour between 0 and 23'
)

DataFrame Split On Rows and apply on header one column using Python Pandas

I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.

Python: outputting lists to excel

For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')

Categories