New to python/pandas... I get a profit and loss report from my trading brokerage in PDF. I can convert it into a spreadsheet that splits it up into headings on one worksheet and a table on another worksheet which it does for each page of the PDF report. I do this simply and cleanly for now just using export from Adobe PDF.
What I'm trying to figure out next is how to loop through the worksheet and ignore Tabs that have headings or data that does not match the columns of the Tabs that I want to append. I'm able to use this tutorial https://pythoninoffice.com/use-python-to-combine-multiple-excel-files/ for the most part, but I have no idea how to generalise the process so that over time as the report grows in PDF pages that I do not need to specify to pandas what number tabs to append and what number tabs to ignore, and can just run a loop in python until it runs out of tabs/worksheets to look at that match certain headers for appending.
Any help or pointers would be appreciated.
Ignore the tabs with this Tab header
Append tabs with this header
You could take advantage of the fact that you only want sheets with more than one row, read all sheets, concat only the ones with more than one row, here goes:
import pandas as pd
# create sample data
df_many_lines = pd.DataFrame([[1,2,3], [1,2,3], [1,2,3]])
df_one_line = pd.DataFrame([10,20,30]).T
# create sample file
with pd.ExcelWriter('output.xlsx') as writer:
for i in range(10):
df_one_line.to_excel(writer, sheet_name=f'Sheet_name_{i}', index=False)
df_many_lines.to_excel(writer, sheet_name=f'Sheet_name_1{i}', index=False)
# read all the sheets to a dict of dfs
df_dict = pd.read_excel('output.xlsx', sheet_name=None)
# concat the ones with more than one row to one df
pd.concat([df for df in df_dict.values() if df.shape[0]>1]).reset_index(drop=True)
I'm the original author of that tutorial you quoted at https://pythoninoffice.com
That's a lot of trades that you made! If I understand your question correctly, you are trying to:
Ignore tabs that contain only the header row, and
Append tabs that contain actual trade information
Since this is a p&l report, I'm assuming that all trade tabs (green tabs you highlighted) have the same format? Also, all the header (red) tabs contain the same format, i.e. only headers?
You can probably read all tabs and check if there's actually anything inside the dataframe. If not, you know it's a header (red) tab, otherwise, if there's data, then it's a trade (green) tab.
import pandas as pd
temp = pd.DataFrame()
with pd.ExcelFile('file_path') as excel:
for sheet in excel.sheet_names:
df = excel.parse(sheet)
if not df.empty:
temp = temp.append(df)
On a side note, to make your process more streamlined, you can try to use Python to convert a pdf to Excel instead of manually doing it in Adobe. https://pythoninoffice.com/pdf-to-excel-with-python/
Related
I am trying to create a Hyperlink for each item in a column based on another column.
Here is an image to help you understand better:
Each title should hyperlink to the corresponding URL. (when you click apple, it should go to apple.com, when you click banana it should go to banana.com, so on) Is there a way to do this to a CSV file in python?
(Let's say my data is 2000 rows)
Thanks in advance
You can use (I used) the pandas library to read the data from csv and later write it to excel, and leverage the Excel function HYPERLINK to make the cell a, well, Hyperlink. The Excel function for HYPERLINK requires the url we are going to (with http:// or https:// at the beginning) and then the visible text, or friendly-name.
One thing the Excel file will not have when you open it is blue underlined text for the Hyperlinked cells. If that is needed, you can probably leverage this solution somewhat and also use the XlsxWriter module.
import pandas as pd
df = pd.read_csv("path\test.csv") #put your actual path here
# this will read the file and save it is a 'dataframe', basically a table.
df['title'] = '=HYPERLINK("https://' + df["url"] +'","' + df["title"]+'")'
"""the df that we originally pulled in has the columns 'title' and 'url'. This line re-writes the values in
'title' to use the HYPERLINK formula in Excel, where the link is the value from the URL column with 'https://'
added to the beginning, and then uses the value from 'title' as the text the user will see in Excel"""
df.to_excel("save_location\test2.xlsx",index=False) #this saves the new file as an Excel. index=False removes the index column that was created when you first make any dataframe
If you want your output to just be one column, the one they click, your final line will be slightly different:
df['title'].to_excel("save_location\test2.xlsx",index=False)
Did not work or probably you have included too many 's
import pandas as pd
df = pd.read_csv('test.csv')
df['URL'] = 'https://url/'+df['id'].astype(str)
keep_col = ['URL']
newurl = df[keep_col]
newurl.to_csv("newurl.csv", index=False)
This code is working but the output file does not show a clickable url
i'm working with openpyxl on a .xlsx file which has around 10K products, of which some are "regular items" and some are products that need to be ordered when required. For the project I'm doing I would like to delete all of the rows containing the items that need to be ordered.
I tested this with a small sample size of the actual workbook and did have the code working the way I wanted to. However when I tried this in the actual workbook with 10K rows it seems to be taking forever to delete those rows (it has been running for nearly and hour now).
Here's the code that I used:
wb = openpyxl.load_workbook('prod.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def clean_workbook():
for row in sheet:
for cell in row:
if cell.value == 'ordered':
sheet.delete_rows(cell.row)
I would like to know is there a faster way of doing this with some tweaks in my code? Or is there a better way to just read just the regular stock from the workbook without deleting the unwanted items?
Deleting rows in loops can be slow because openpyxl has to update all the cells below the row being deleted. Therefore, you should do this as little as possible. One way is to collect a list of row numbers, check for contiguous groups and then delete using this list from the bottom.
A better approach might be to loop through ws.values and write to a new worksheet filtering out the relevant rows. Copy any other relevant data such as formatting, etc. Then you can delete the original worksheet and rename the new one.
ws1 = wb['My Sheet']
ws2 = wb.create_sheet('My Sheet New')
for row in ws1.values:
if row[x] == "ordered": # we can assume this is always the same column
continue
ws2.append(row)
del wb["My Sheet"]
ws2.title = "My Sheet"
For more sophisticated filtering you will probably want to load the values into a Pandas dataframe, make the changes and then write to a new sheet.
You can open with read-only mode, and import all content into a list, then modify in list is always a lot more faster than working in excel. After you modify the list, made a new worksheet and upload your list back to excel. I did this way with my 100k items excel .
I have a few hundred files with data and hyperlinks in them that I was trying to upload and append to a single DataFrame when I realized that Pandas was not reading any of the hyperlinks.
I then tried to use Openpyxl to read the hyperlinks in the input Excel files and write a new column into the excels with the text of the hyperlink that hopefully Pandas can read into my dataframe.
However, I am running into issues with my testing the openpyxl code. It is able to read and write some of the hyperlinks but not the others.
My sample file has three rows and looks like this:
My actual data has hyperlinks in the way that I have it for "Google" in my test data set.
The other two hyperlinks in my text data, I inserted by right clicking on the cell and pasting the link.
Sample Test file here: Text.xlsx
Here is the code I wrote to read the hyperlink and paste it in a new column. It works for the first two rows (India and China) but fails for the third row (Google). It's unfortunate because all of my actual data is of that type. Can someone please help me figure it out?
import openpyxl
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
column_indices = [1]
max_col = ws.max_column
ws.cell(row=1,column = max_col+1).value = "Hyperlink Text"
for row in range(2,ws.max_row+1):
for col in column_indices:
print(ws.cell(row, column=1).hyperlink.target)
ws.cell(column=max_col+1,row=row).value = ws.cell(row, column=1).hyperlink.target
wb.save('test.xlsx')
The cells where you are using the HYPERLINK function (like google.com) will not be of type hyperlink. You will need to process the cells with HyperLink function using re so similar function.
The values looks like below,
>>> ws.cell(2,1).value
'China'
>>> ws.cell(3,1).value
'India'
>>> ws.cell(4,1).value
'=HYPERLINK("www.google.com","google")'
Suggested code to handle HYPERLINK :
val = ws.cell(row,column).value
if val.find("=HYPERLINK") >= 0 :
hyplink = ws.cell(4,1).value # Or use re module for more robust check
Note : The second for loop to iterate over columns seems not required since you are always using column=1.
I am reading an excel sheet and plucking data from rows containing the given PO.
import pandas as pd
xlsx = pd.ExcelFile('Book2.xlsx')
df = pd.read_excel(xlsx)
PO_arr = ['121121','212121']
for i in PO_arr:
PO = i
PO_DATA = df.loc[df['PONUM'] == PO]
for i in range(1, max(PO_DATA['POLINENUM'].values) +1):
When I take this Excel sheet straight from its source, my code works fine. But when I cut out only the rows I want and paste them to a new spreadsheet with the exact same formatting and read this new spreadsheet, I have to change PO_DATA to look for an integer instead of a string as such:
PO_DATA = df.loc[df['PONUM'] == int(PO)]
If not, I get an error, and calling PO_DATA returns an empty dataframe.
C:\...\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
I checked the cell formatting in Excel and in both cases, they are formatted as 'General' cells.
What is going on that makes it so when I chop up my spreadsheet, I have to look for an integer and not a string? What do I have to do to make it work for sheets I've created and pasted relevant data into instead of only sheets from the source?
Excel can do some funky formatting when copy and paste is used: ctl-c : ctl-v.
I am sure you tried these but...
A) Try copy ctl-c then ctl-alt-v:"v":enter ... on new sheet/file
B) Try using the format painter in Excel : Looks like a paintbrush on the home tab - select the properly formatted cells first - double click format painter - move to your new file/sheet - select cells you want the format to conform to.
C) Select your new file/table you pasted into - select purple eraser icon from the top options in excel - clear all formats
Update: I found an old related thread that didn't necessarily answer the question but solved the problem.
you can force pandas to import values as a certain datatype when reading from excel using the converters argument for read_excel.
df = pd.read_excel(xlsx, converters={'POLINENUM':int,'PONUM':int})
it would help me produce output that was a lot neater and a little more 'human-like' if I could use pandas and xlsxwriter in a way that would stack two dataframes, one on top of the other, on the same sheet of the Excel spreadsheet I am outputting.
Pls note the data of the two dataframes is related but different, one being a summary of the other.
Is there a neat way I can just take my dataframe and my summary dataframe and stack them on the same sheet?
Yes, it's very much possible.
You will have to build the excel file using xlsxwriter and keep track of the current cell as you write.
Here is some pseudo code/syntax (I use xlsxwriter as an extension of Pandas):
wb = pd.ExcelWriter(file)
tab = wb.sheets["My Tab"]
row, column = 9, 1
df1.to_excel(wb, tab, header=False, startrow=row, startcol=column, index=False)
row += 4
column = 1
df2.to_excel(wb, tab, header=False, startrow=row, startcol=column, index=False)
I am missing some parts in here, but all I really wanted to do was illustrate the point.
I've built out a flimsy Report Class to do this for me. You can see some of my syntax in the .write_tab method.