Background: My first Excel-related script. Using openpyxl.
There is an Excel sheet with loads of different types of data about products in different columns.
My goal is to extract certain types of data from certain columns (e.g. price, barcode, status), assign those to the unique product code and then output product code, price, barcode and status to a new excel doc.
I have succeeded in extracting the data and putting it the following dictionary format:
productData = {'AB123': {'barcode': 123456, 'price': 50, 'status': 'NEW'}
My general thinking on getting this output to a new report is something like this (although I know that this is wrong):
newReport = openpyxl.Workbook()
newSheet = newReport.active
newSheet.title = 'Output'
newSheet['A1'].value = 'Product Code'
newSheet['B1'].value = 'Price'
newSheet['C1'].value = 'Barcode'
newSheet['D1'].value = 'Status'
for row in range(2, len(productData) + 1):
newSheet['A' + str(row)].value = productData[productCode]
newSheet['B' + str(row)].value = productPrice
newSheet['C' + str(row)].value = productBarcode
newSheet['D' + str(row)].value = productStatus
newReport.save('ihopethisworks.xlsx')
What do I actually need to do to output the data?
I would suggest using Pandas for that. It has the following syntax:
df = pd.read_excel('your_file.xlsx')
df['Column name you want'].to_excel('new_file.xlsx')
You can do a lot more with it. Openpyxl might not be the right tool for your task (Openpyxl is too general).
P.S. I would leave this in the comments, but stackoverflow, in their widom decided to let anyone to leave answers, but not to comment.
The logic you use to extract the data is missing but I suspect the best approach is to use it to loop over the two worksheets in parallel. You can then avoid using a dictionary entirely and just append loops to the new worksheet.
Pseudocode:
ws1 # source worksheet
ws2 # new worksheet
product = []
code = ws1[…] # some lookup
barcode = ws1[…]
price = ws1[…]
status = ws1[…]
ws2.append([code, price, barcode, status])
Pandas will work best for this
here are some examples
import pandas as pd
#df columns: Date Open High Low Close Volume
#reading data from an excel
df = pd.read_excel('GOOG-NYSE_SPY.xls')
#set index to the column of your choice, in this case it would be date
df.set_index('Date', inplace = True)
#choosing the columns of your choice for further manipulation
df = df[['Open', 'Close']]
#divide two colums to get the % change
df = (df['Open'] - df['Close']) / df['Close'] * 100
print(df.head())
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
Recently I am trying to parse data from Excel sheet using Python and I successfully parsed it but I don't need some rows from that Excel sheet. So how do I do it(may be using looping)? Here the code which I wrote to parse the Excel sheet:
import xlrd
book = xlrd.open_workbook("Excel.xlsx")
sheet = book.sheet_by_index(0)
firstcol = sheet.col_values(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in
range(sheet.nrows)]
ele=''
year=[]
for j in range(len(data)):
if j==1:
year=data[j]
if j>2:
ele=data[j][0]
for i in range(1, len(data[j])):
if ele != "":
if data[j][i] != "":
if year[i] !="":
print([ele, data[j][i], year[i]])
With that all rows are parsing in list format which I want, but I don't want some rows**( Like Total age, Total IDs, Total Result)** from Excel file, So how can I implement it in the same code or suggest some other effective way(may be pandas) to reduce code or any powerful way. The Excel file to which I'm referring:
Click to see Excel.xlsx
Thanks in Advance...
If I understand correctly, you can do this much more simply. You have some list of rows to exclude:
rows_to_exclude = ['Total age', 'Total IDS', 'Total Result']
You can read in the dataframe using pd.read_excel without xlrd (no need to specify the sheet index if it's the first sheet, which is read by default). Then you can drop the rows with missing values, and drop all rows whose index is in your list of excluded row labels:
df = pd.read_excel('Excel.xlsx')
df = df.dropna().drop(rows_to_exclude)
I have an excel workbook with multiple sheets with some sales data. I am trying to sort them so that each customer has a separate sheet(different workbook), and has the item details. I have created a dictionary with all customernames.
for name in cust_dict.keys():
cust_dict[name] = pd.DataFrame(columns=cols)
for sheet in sheets:
ws = sales_wb.sheet_by_name(sheet)
code = ws.cell(4, 0).value #This is the item code
df = pd.read_excel(sales_wb, engine='xlrd', sheet_name=sheet, skiprows=7)
df = df.fillna(0)
count = 0
for index,row in df.iterrows():
print('rotation '+str(count))
count+=1
if row['Particulars'] != 0 and row['Particulars'] not in no_cust:
cust_name = row['Particulars']
# try:
cust_dict[cust_name] = cust_dict[cust_name].append(df.loc[df['Particulars'] == cust_name],ignore_index=False)
cust_dict[cust_name] = cust_dict[cust_name].drop_duplicates()
cust_dict[cust_name]['Particulars'] = code
Right now I have to drop duplicates because the Particulars has the client name more than once and hence the cope appends the data say x number of times.
I would like to avoid this but I can't seem to figure out a good way to do it.
The second problem is that since the code changes to the code in the last sheet for all rows, but I want it to remain the same for the rows pulled from a particular sheet.
I can't seem to figure out a way around both the above problems.
Thanks
I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.
For my master thesis, I need to calculate expected returns for x number of stocks on a given event date. I have written the following code, which does what I intends (match Fama & French factors with a sample of event dates). However, when I try to export it to excel I can't seem to get the correct output. I.e. it doesn't contain column headings such as Dates, names of fama & french factors and the corresponding rows.
Does anybody have a workaround for this? Any improvements are gladly appreciated. Here are my code:
import pandas as pd
# Data import
ff_five = pd.read_excel('C:/Users/MBV/Desktop/cmon.xlsx',
infer_datetime_format=True)
df = pd.read_csv('C:/Users/MBV/Desktop/4.csv', parse_dates=True,
infer_datetime_format=True)
# Converting dates to datetime
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Creating an empty placeholder
end_date = []
# Iterating over the event dates, creating a start and end date 60 months
apart
for index, row in df.iterrows():
end_da = row['Date']-pd.DateOffset(months=60)
end_date.append(end_da)
end_date_df = pd.DataFrame(data=end_date)
m = pd.merge(end_date_df,df,left_index=True,right_index=True)
m.columns = ['Start','End']
ff_factors = []
for index, row in m.iterrows():
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
time_range= (ff_five['Date'] > row['Start']) & (ff_five['Date'] <=
row['End'])
df = ff_five.loc[time_range]
ff_factors.append(df)
EDIT:
Here are my attempt at getting the data from python to excel.
ff_factors_df = pd.DataFrame(data=ff_factors)
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('estimation_data.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
ff_factors_df.to_csv(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
To output a dataframe to csv or excel should be able to be done with
ff_five.to_excel('Filename.xls')
Change excel to csv if you want it to a csv.
Ok I tried to interpret what you were trying to do without it being very clear. But if I was interpreting it correctly you are trying to create some addition columns based on other data. Instead of creating separate lists you could possibly just put them in as new columns and then just output the columns you want potentially. Something like this maybe (had to make some assumptions and create some fake data to see if this is on the right track):
import pandas as pd
ff_five = pd.DataFrame()
ff_five['Date'] = ["2012-11-01", "2012-11-30"]
df = pd.DataFrame()
df['Date'] = ["2012-12-01", "2012-12-30"]
df['Date'] = pd.to_datetime(df['Date'])
df['End'] = df['Date'] - pd.DateOffset(months=60)
df.columns = ['Start', 'End']
ff_five['Date'] = pd.to_datetime(ff_five['Date'])
df['ff_factor'] = (ff_five['Date'] > df['Start']) & (ff_five['Date'] <= df['End'])
df.to_excel('estimation_data.xlsx', sheet_name='Sheet1')