Duplicate function retuning non-duplicated results on a BLAST hittable - python

New to python (3 weeks!) and unsurprisingly having difficulties. I'm working on a BLAST hittable and am trying to identify sequences coming from the same hit by using duplicate on the accession number only. I do not want to discard these results but rather save them to a new file so I can take a look to see if anything interesting if popping up.
A snippet of the table (this is purely an example, it includes 11 columns but it seems excessive to print them all here):
Query
Accession
% identity
mismatches
gaps
s start
s end
Q112
ABCDEFG111222
90.99
9
3
1000
2000
Q112
HIJKLMN222111
80
14
98
128
900
Q112
OPQRSTUV33111
76
2
23
12
900
I'm importing the file to make it a data frame using pandas, then using reset_index to replace the query number with an index.
I have then done the following:
To find out if I have any duplicates in Accession column
print(file.Accession.duplicated().sum())
To print those results into a new data frame (n is the same)
filedupes = (file.loc[file.Accession.duplicated()])
Finally, to then write it to a csv for me to look through
fileDupes.to_csv('Filedupes.csv', sep='\t', encoding='utf-8')
This does half work, as my CSV does contain duplicated entries based on Accession number only but it also contains some unique Accession number entries. These entries only seem to have the first 2 letters identical to other entries but then the rest is unique
ie I have XM_JK1234343 and XM_983HSJAN and XM_83QMZBDH1 included despite having no other entry present (have checked using find/replace). The other 11 columns are also unique to these strange entries.
I am stumped, is it my code? Have I not specified enough and allowed the above examples to be chucked in with other legit duplicates? I have tried to find if someone else had asked a similar question but no luck - thank you kindly in advance for any insight and apologises in advance if this is silly mistake!

Related

How to compare 2 excel files using python and get the count of number of cells mismatched

I need to use Python to compare two excel files and identify differences.
i used this below code to find the difference
for item in zip (rows,cols):
self.df1.iloc[item[0],item[1]]='{} --> {}'.format(self.df1.iloc[item[0],item[1]],self.df2.iloc[item[0],item[1]])
self.me=self.df1.to_excel('output.xlsx',index=False,header=True,sheet_name='output')
but also need to obtain detailed information as follows
the total number of cells that are mismatched and matched
count of rows that are mismatched and matched
count of columns that are mismatched and matched
Please guide me towards a solution because I have no idea how to achieve this.
Thanks
Manoranjini

Python. Deleting Excel rows while iterating. Alternative for OpenPyXl or solution for ws.max_rows wrong output

I'm working with Python on Excel files. Until now I was using OpenPyXl. I need to iterate over the rows and delete some of them if they do not meet specific criteria let's say I was using something like:
current_row = 1
while current_row <= ws.max_row
if 'something' in ws[f'L{row}'].value:
data_ws.delete_rows(current_row)
continue
current_row += 1
Everything was alright until I have encountered problem with ws.max_rows. In a new Excel file which I've received to process ws.max_rows was returning more rows than it was in the reality. After some googling I've found out why is it happening.
Here's a great explanation of the problem which I've found in the comment section on the Stack:
However, ws.max_row will not check if last rows are empty or not. If cell's content at the end of the worksheet is deleted using Del key or by removing duplicates, remaining empty rows at the end of your data will still count as a used row. If you do not want to keep these empty rows, you will have to delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete) –
V. Brunelle
Thanks V. Brunelle for very good explanation of the cause of the problem.
In my case it is because some of the rows are deleted by removing duplicates. For e.g. there's 400 rows in my file listed one by one (without any gaps) but ws.max_row is returning 500
For now I'm using a quick fix:
while current_row <= len([row for row in data_ws.iter_rows(min_row=min_row) if not all([cell.value is None for cell in row])])
But I know that it is very inefficient. That's the reason why I'm asking this question. I'm looking for possible solution.
From what I've found here on the Stack I can:
Create a copy of the worksheet and iterate over that copy and ws.delete_rows in the original worksheet so I will need to my fix only once
Iterate backwards with for_loop so I won't have to deal with ws.max_rows since for_loops works fine in that case (they read proper file dimensions). This method seems promising for me, but always I've got 4 rows at the top of the workbook which I'm not touching at all and potential debugging would need to be done backwards as well, which might not be very enjoyable :D.
Use other python library to process Excel files, but I don't know which one would be better, because keeping workbook styles is very important to me (and making changes in them if needed). I've read some promising things about pywin32 library (win32com.client), but it seems lacking documentation and it might be hard to work with it and also I don't know how does it look in performance matter. I was also considering pandas, but in kind words it's messing up the styles (in reality it deletes all styles in the worksheet).
I'm stuck now, because I really don't know which route should I choose.
I would appreciate every advice/opinion in the topic and if possible I would like to make a small discussion here.
Best regards!
If max rows doesn't report what you expect you'll need to sort the issue best you can and perhaps that might be by manually deleting; "delete those entire rows by selecting rows number on the left of your spreadsheet and deleting them (right click on selected row number(s) -> Delete)" or making some other determination in your code as what the last row is, then perhaps programatically deleting all the rows from there to max_row so at least it reports correctly on the next code run.
You could also incorporate your fix code into your example code for deleting rows that meet specific criteria.
For example; a test sheet has 9 rows of data but cell B15 is an empty string so max_rows returns 15 rather than 9.
The example code checks each used cell in the row for None type in the cell value and only processes the 9 rows with data.
from openpyxl import load_workbook
filename = "foo.xlsx"
wb = load_workbook(filename)
data_ws = wb['Sheet1']
print(f"Max Rows Reports {data_ws.max_row}")
for row in data_ws:
print(f"Checking row {row[0].row}")
if all(cell.value is not None for cell in row):
if 'something' in data_ws[f'L{row[0].row}'].value:
data_ws.delete_rows(row[0].row)
else:
print(f"Actual Max Rows is {row[0].row}")
break
wb.save('out_' + filename)
Output
Max Rows Reports 15
Checking row 1
Checking row 2
Checking row 3
Checking row 4
Checking row 5
Checking row 6
Checking row 7
Checking row 8
Checking row 9
Actual Max Rows is 9
Of course this is not perfect, if any of the 9 rows with data had one cell value of None the loop would stop at that point. However if you know that's not going to be the case it may be all you need.

Why is duplicated() showing different results on similar codes, of the same dataframe?

I am currently working on an NLP project on my own, and am having some troubles even after reading through the documentations. scraped reddit posts and wanted to find out which posts are duplicated for the 'selftext' and 'title' column. The 3 codes shown below are what was inputted and the results are shown in the picture
May i ask why is there non duplicated posts for code 2 and 3 with reference to code 1?
(1)investing_data[['selftext','title']][investing_data.duplicated(subset=['selftext','title'])]
(2)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['selftext'])]
(3)investing_data[['selftext', 'title']][investing_data.duplicated(subset=['title'])]
screenshot of the 3 codes above
You do have in fact different data that you check for duplicates:
What you see in all three lines are the duplicates, i.e. the second, third and so forth occurrence of the line.
In line (1): you check if both selftext and title are the same.
In line (2): you check for entries that have a duplicated selftext.
In line (3): you check for entries that have a duplicated title.
Using your visualisation you only see the duplicates, but not all entries that actually are duplicates (i.e. including the first occurrence).
For that you can use something like the following:
investing_data[investing_data['selftext'] == investing_data[investing_data.duplicated(subset=['selftext'])]['selftext']][['selftext', 'title']]
What is done with this is that you get all duplicates that contain selftext and and search in the original dataframe for exactly those entries and display selftext and title.

Manipulating API data in Python before appending to a csv file

I'm requesting data from a RESTful API. The first request is written to a csv file with no problems. In the csv file the data has 5 header rows (including column headers), 11 rows of actual data (13 fields per row), and an EOF row, so 17 rows of data in all (the data as it appears following a print(response.text) command is shown at the end of this post.
For subsequent requests to the API I simply want to append the 11 rows of data (i.e. rows 6 through 16) to the existing csv file. This is a process I will repeat numerous times in order to create a single large csv file with probably close to a million rows of data. I'm struggling to find a way to manipulate the data returned by the API so as to allow me to only write rows 6 through 16 to the csv file.
I'm pretty new to coding and Python, so I'd be grateful for suggestions as to how to proceed.
This is what the data looks like from a Python 'print' command (the first asterix is row 1. The fifth asterix denotes the start of the column headings, with 'Document RevNum' being the last column heading):
*
*
*Actual Aggregated Generation Per Type (B1620) Data
*
*Document Type,Business Type,Process Type,Time Series ID,Quantity,Curve Type,Resolution,Settlement Date,Settlement Period,Power System Resource Type,Active Flag,Document ID,Document RevNum
Actual generation per type,Solar generation,Realised,NGET-EMFIP-AGPT-TS-21614701,3250,Sequential fixed size block,PT30M,2020-07-01,21,"Solar",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614702,2075.338,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Offshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Wind generation,Realised,NGET-EMFIP-AGPT-TS-21614703,1486.519,Sequential fixed size block,PT30M,2020-07-01,21,"Wind Onshore",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614704,258,Sequential fixed size block,PT30M,2020-07-01,21,"Other",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614705,4871,Sequential fixed size block,PT30M,2020-07-01,21,"Nuclear",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614706,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Oil",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614707,16448,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Gas",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614708,0,Sequential fixed size block,PT30M,2020-07-01,21,"Fossil Hard coal",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614709,783,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Run-of-river and poundage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614710,118,Sequential fixed size block,PT30M,2020-07-01,21,"Hydro Pumped Storage",Y,NGET-EMFIP-AGPT-06372506,1
Actual generation per type,Production,Realised,NGET-EMFIP-AGPT-TS-21614711,3029,Sequential fixed size block,PT30M,2020-07-01,21,"Biomass",Y,NGET-EMFIP-AGPT-06372506,1
<EOF>
I interpret your problem as not being able to append the newly fetched data to the CSV file. I am not sure whether you are working with some module that helps with working with CSV files, but I just assume you don't for now.
If you just open the file with "a" as the second argument to open like (f = open("file.csv","a")) you can easily append the new content. You would first off have to strip the EOF row though and later append a new one, but I think that isn't the problem.
Hope I could help you, and please tell me whether I understood correctly what your problem is here :)
Btw, I would recommend looking into a CSV module for this or something like sqlite3.
The solution which seems to work is as follows:
Recall that the API returns a long string of comma-separated data for a given date.
When the data is written to a csv file it presents as 4 rows of 'header' data that I'm not interested in, 1 row of column heading data (13 columns), 11 rows of the data that I AM interested in (with data in all 13 columns), and 1 ("EOF") row that I don't need.
On the first API query I want to create a csv file with only the column headings and 11 rows of data, jettisoning the first 4 rows (redundant info) and the last row ("EOF") of data.
On all subsequent API queries I simply want to append the 11 rows of data to the already created csv file.
The API response is returned as a string.
The following code excludes the first 4 rows and the last row ("EOF") but still writes the column headings and the 11 rows of useful data to the newly created csv file:
# Write to .CSV
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "w") as f:
f.write(api_response.text[59:-5])
The first 59 characters and last 5 characters of the API response string will always be the same so I can be confident that this will work for the initial API response.
For subsequent API responses I use the following code to append to the csv file:
# Now append the api response data to the CSV file
with open('C:/Users/Paul/.spyder-py3/Elexon API Data 05.csv', "a") as f:
f.write(api_response.text[248:-5])
This excludes the first 248 and last 5 characters from the API response appended to the csv file. The first 248 characters of the string will always include the same redundant info, and the last 5 characters will always include the '', so again I can be confident that only the 11 rows of data I am interested in will be appended to the csv file.
The solution, for this particular case, turned out to be simpler than I had expected, and thanks to CodingGuy for directing me to think in terms of stripping data from the start and end of the api response, and for exploring the type of data I was dealing with.
Please let me know if there are problems with my solution. Likewise, I'm always interested to learn more sophisticated ways of handling data within Python.

Most efficient way to pull information from a table

As a new-to-python person the only way that I can think of to grab information from a table (separated by only whitespace) is to call an item by its position using the row and column its located in. I've been using numpy a lot like so:
information_table=np.array([]) #The table I'm pulling data from
info_i_need_from_table=information_table[i][j] #Where i/j is the location of whatever info I need
Is this the optimal way of grabbing information from a table? I'm new to python so as far as I know this is the only way to do it, but I'd be willing to bet I'm wrong. Say my information_table is fairly large, thousands of rows, hundreds of colums. Would you utilize the same 'tool' to pull information from, say, a much smaller table?
As an example of one of the tables I'm working with:
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.26E-05 1.70E+14 81 2.61E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792598059
/SAH/SAH5/jimunoz/DUSTYlib2/models_Y100_699K/COMPACTs3300_Al_g+1.5_m1.0_t02_st_z-0.25.inp 3300 699 1.000E+02 Al2O3 1.06E+05 1.60E-05 1.70E+14 81 3.12E-10 0.360484737991 0.77871386826 1.03440307618 0.568135259544 0.157877963222 0.0791445961324 0.0398783584044 0.0159762347055 0.000741792620505
Those are just the first two rows. Another table I may be working with would just have numerous rows and columns of floating point numbers (at about 5 sig-figs).

Categories