I'm working with Python on Excel files. Until now I was using OpenPyXl. I need to iterate over the rows and delete some of them if they do not meet specific criteria let's say I was using something like:
current_row = 1
while current_row <= ws.max_row
if 'something' in ws[f'L{row}'].value:
data_ws.delete_rows(current_row)
continue
current_row += 1
Everything was alright until I have encountered problem with ws.max_rows. In a new Excel file which I've received to process ws.max_rows was returning more rows than it was in the reality. After some googling I've found out why is it happening.
Here's a great explanation of the problem which I've found in the comment section on the Stack:
However, ws.max_row will not check if last rows are empty or not. If cell's content at the end of the worksheet is deleted using Del key or by removing duplicates, remaining empty rows at the end of your data will still count as a used row. If you do not want to keep these empty rows, you will have to delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete) –
V. Brunelle
Thanks V. Brunelle for very good explanation of the cause of the problem.
In my case it is because some of the rows are deleted by removing duplicates. For e.g. there's 400 rows in my file listed one by one (without any gaps) but ws.max_row is returning 500
For now I'm using a quick fix:
while current_row <= len([row for row in data_ws.iter_rows(min_row=min_row) if not all([cell.value is None for cell in row])])
But I know that it is very inefficient. That's the reason why I'm asking this question. I'm looking for possible solution.
From what I've found here on the Stack I can:
Create a copy of the worksheet and iterate over that copy and ws.delete_rows in the original worksheet so I will need to my fix only once
Iterate backwards with for_loop so I won't have to deal with ws.max_rows since for_loops works fine in that case (they read proper file dimensions). This method seems promising for me, but always I've got 4 rows at the top of the workbook which I'm not touching at all and potential debugging would need to be done backwards as well, which might not be very enjoyable :D.
Use other python library to process Excel files, but I don't know which one would be better, because keeping workbook styles is very important to me (and making changes in them if needed). I've read some promising things about pywin32 library (win32com.client), but it seems lacking documentation and it might be hard to work with it and also I don't know how does it look in performance matter. I was also considering pandas, but in kind words it's messing up the styles (in reality it deletes all styles in the worksheet).
I'm stuck now, because I really don't know which route should I choose.
I would appreciate every advice/opinion in the topic and if possible I would like to make a small discussion here.
Best regards!
If max rows doesn't report what you expect you'll need to sort the issue best you can and perhaps that might be by manually deleting; "delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete)" or making some other determination in your code as what the last row is, then perhaps programatically deleting all the rows from there to max_row so at least it reports correctly on the next code run.
You could also incorporate your fix code into your example code for deleting rows that meet specific criteria.
For example; a test sheet has 9 rows of data but cell B15 is an empty string so max_rows returns 15 rather than 9.
The example code checks each used cell in the row for None type in the cell value and only processes the 9 rows with data.
from openpyxl import load_workbook
filename = "foo.xlsx"
wb = load_workbook(filename)
data_ws = wb['Sheet1']
print(f"Max Rows Reports {data_ws.max_row}")
for row in data_ws:
print(f"Checking row {row[0].row}")
if all(cell.value is not None for cell in row):
if 'something' in data_ws[f'L{row[0].row}'].value:
data_ws.delete_rows(row[0].row)
else:
print(f"Actual Max Rows is {row[0].row}")
break
wb.save('out_' + filename)
Output
Max Rows Reports 15
Checking row 1
Checking row 2
Checking row 3
Checking row 4
Checking row 5
Checking row 6
Checking row 7
Checking row 8
Checking row 9
Actual Max Rows is 9
Of course this is not perfect, if any of the 9 rows with data had one cell value of None the loop would stop at that point. However if you know that's not going to be the case it may be all you need.
I am currently working on an NLP project on my own, and am having some troubles even after reading through the documentations. scraped reddit posts and wanted to find out which posts are duplicated for the 'selftext' and 'title' column. The 3 codes shown below are what was inputted and the results are shown in the picture
May i ask why is there non duplicated posts for code 2 and 3 with reference to code 1?
(1)investing_data[['selftext','title']][investing_data.duplicated(subset=['selftext','title'])]
(2)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['selftext'])]
(3)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['title'])]
screenshot of the 3 codes above
You do have in fact different data that you check for duplicates:
What you see in all three lines are the duplicates, i.e. the second, third and so forth occurrence of the line.
In line (1): you check if both selftext and title are the same.
In line (2): you check for entries that have a duplicated selftext.
In line (3): you check for entries that have a duplicated title.
Using your visualisation you only see the duplicates, but not all entries that actually are duplicates (i.e. including the first occurrence).
For that you can use something like the following:
investing_data[investing_data['selftext'] == investing_data[investing_data.duplicated(subset=['selftext'])]['selftext']][['selftext', 'title']]
What is done with this is that you get all duplicates that contain selftext and and search in the original dataframe for exactly those entries and display selftext and title.
I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>
I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.
The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.