Sort a list based on values from dataframe - python

Is my approach here the right way to do it in Python? As I'm new to Python, I appreciate any feedback you can provide, especially if I'm way off here.
My task is to order a list of file names based on values from a dataset. Specifically, these are file names that I need to sort based on site information. The resulting list is the order in which the reports will be printed.
Site Information
key_info = pd.DataFrame({
'key_id': ['1010','3030','2020','5050','4040','4040']
, 'key_name': ['Name_A','Name_B','Name_C','Name_D','Name_E','Name_E']
, 'key_value': [1,2,3,4,5,6]
})
key_info = key_info[['key_id','key_name']].drop_duplicates()
key_info['key_id'] = key_info.key_id.astype('str').astype('int64')
Filenames
These are the file names I need to sort. In this example, I sort by just the key_id, but I assume I could easily add a column to site information and sort it by that as well.
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
Sorting
The resulting "filenames" is the final sorted list.
names_df = pd.DataFrame({'filename': filenames})
names_df['key_id'] = names_df.filename.str[:4].astype('str').astype('int64')
merged_df = pd.merge(key_info, names_df, on='key_id', how='right')
merged_df = merged_df.sort_values('key_id')
filenames = merged_df['filename'].tolist()
I'm looking for any solutions that might be better or more Pythonic. Or, if there is a more appropriate place to post "code review" questions.

I like your use of Pandas, but it isn't the most Pythonic as it uses data structures that are a superset of Python. Nevertheless, I think we can improve on what you have. I will show an improved version and I will show a completely native Python way to do it. Either is fine I suppose?
The strictly Python version is best for people who do know Pandas as there's a large learning curve associated with that.
Common
For both examples, let's assume a function like this:
def trim_filenames(filename):
return filename[0:4]
I use this in both examples.
Improvements
# Load the DataFrame and give it a proper index (I added some data)
key_info = pd.DataFrame(index=['2020','5050','4040','4040','6000','7000','1010','3030'], data={'key_name':['Name_C','Name_D','Name_E','Name_E','Name_F','Name_G','Name_A','Name_B'], 'key_value' :[1,2,3,4,5,6,7,8]})
# Eliminate duplicates and sort in one step
key_info = key_info.groupby(key_info.index).first()
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
names_df = pd.DataFrame({'filename': filenames})
# Let's give this an index too so we can match on the index (not the function call)
names_df.index=names_df.filename.transform(trim_filenames)
combined = pd.concat([key_info,names_df], axis=1)
combined matches by index, but there are some keys with no filenames. It looks like this now:
key_name key_value filename
1010 Name_A 7 1010_Filename
2020 Name_C 1 2020_Filename
3030 Name_B 8 3030_Filename
4040 Name_E 3 4040_Filename
5050 Name_D 2 5050_Filename
6000 Name_F 5 NaN
7000 Name_G 6 NaN
Now we drop the NaN columns and create the list of filenames:
combined.filename.dropna().values.tolist()
['1010_Filename', '2020_Filename', '3030_Filename', '4040_Filename', '5050_Filename']
Python Only Version (no framework)
key_info = {'2020' : {'key_name':'Name_C', 'key_value':1},'5050' : {'key_name':'Name_D', 'key_value':2},'4040' : {'key_name':'Name_E', 'key_value':3},'4040' : {'key_name':'Name_E', 'key_value':4},'6000' : {'key_name':'Name_F', 'key_value':5},'7000' : {'key_name':'Name_G', 'key_value':6},'1010' : {'key_name':'Name_A', 'key_value':7},'3030' : {'key_name':'Name_B', 'key_value':8}}
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
# Let's get a dictionary of filenames that is keyed by the same key as in key_info:
hashed_filenames = {}
for filename in filenames:
# Note here I'm using the function again
hashed_filenames[trim_filenames(filename)] = filename
# We'll store the new filenames in new_filenames:
new_filenames = []
# sort the key info and loop it
for key in sorted(key_info.keys()):
# for each key, if the key matches in the hashed_filenames, then add it to the list
if key in hashed_filenames:
new_filenames.append(hashed_filenames[key])
Summary
Both solutions are concise, and I like Pandas, but I like better something that is immediately readable by anyone that knows Python. The Python only solution (of course, they are both Python) is the one you should go with, in my opinion.

out_list = []
for x in key_info.key_id:
for f in filenames:
if str(x) in f:
out_list.append(f)
out_list
['1010_Filename', '3030_Filename', '2020_Filename', '5050_Filename', '4040_Filename']

Related

Keeping data together that are linked in a dataframe when only one is included in an API

I am fairly new to Python so hoping to hear your thoughts on this scenario and question:
I have a dataframe with 1803 rows and 4 columns
Two of the columns are "Sequence" and "Gene names" from which the seqs derive.
My code for a loop that iterates over each seq for the IEDB API is working fine:
url = "http://tools-cluster-interface.iedb.org/tools_api/mhci/"
method = "netmhcpan_el"
allele = "HLA-A*02:01"
length = "9"
for index,row in df_SEQS.iterrows():
sequence = (row["Sequence"])
netmhcpan_el = {"method":method, "sequence_text":sequence, "allele":allele, "length":length}
netmhcpan_el_resp = requests.post(url, netmhcpan_el_query)
output_netmhcpan_el = netmhcpan_el_resp.text
But the output is a list of sequences that are no longer paired with their gene of origin.
How can I get the output to include sequence-gene pairs when the API does not have a variable for "Gene names" from the original df?
I've run out of ideas.
Thanks very much for any suggestions
LG

Order file based on numbers in name

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']
if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']
What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.
If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']
right click your folder and click sort by >> name.

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

How do I avoid repetition in python codes?

I have data from an instrument installed (dendrometer) in 9 different trees, the variables are: date and time, increment and temperature. I go once a month to the field and download more data. Right now I have done 3 field trips, therefore I call the files "den11, den12, den13..." being the first number related to the tree (number 1) and the second related to the trip (1, 2, 3).
I have a few routines to run before I concatenate them and end up with only 9 (dendrom1, dendrom2, dendrom3...) so I can perform some plotting and analysis. But so far I have done a lot copy and paste in my codes which takes a while, is boring and looks terrible. I've tried for loop but I'm a python newbie, learning by myself and this part I haven't cracked.
For example, to read each excel file I had to:
#Tree1
den11= pd.read_excel('den11.xlsx')
den12= pd.read_excel('den12.xlsx')
den13= pd.read_excel('den13.xlsx')
#Tree2
den21= pd.read_excel('den21.xlsx')
den22= pd.read_excel('den22.xlsx')
den23= pd.read_excel('den23.xlsx')
...
#Tree9
Then, to avoid repeating 3 times for each of the nine trees I tried to recreate each of the filenames and assign it to 'f':
trips = [1,2,3]
trees = range(1,10)
for tree in trees:
for trip in trips:
f = 'den' + str(tree) + str(trip)
print(f)
And then I could, maybe read each of them and have their names assigned as a new variable, but I'm clearly not good and I'm missing something here:
os.chdir('...\Plantation\Dendrometers')
basepath = '...\Plantation\Dendrometers'
dlist = os.scandir(basepath)
for dendrometer in dlist:
f = pd.read_excel(dendrometer)
(I used 'os.scandir' instead of 'os.listdir' because I read that scandir can be interate, I thought that could be an issue)
It didn't work, then I tried assigning a list with all the filenames:
flist = ['den11','den12','den13','den21','den22','den23','den31',
'den32','den33','den41','den42','den43','den51','den52',
'den53','den61','den62','den63','den71','den72','den73',
'den81','den82','den83','den91','den92','den93']
which didn't work either, I guess I can't perform functions using turples.
What would be best not to repeat basic routines for each of the files and be prepared for next data coming up? This is what I've done and it feels terrible:
new_columns = ['date','increment','temp']
den11.columns = new_columns
den12.columns = new_columns
den13.columns = new_columns
den21.columns = new_columns
...
den11.set_index('date', inplace=True)
den12.set_index('date', inplace=True)
...
den11 = den11.loc['2019-02-14':]
den12 = den12.loc['2019-02-14':]
...
dendrom1 = pd.concat([den11,den12,den13])
...
dendrom1 = dendrom1.loc[~dendrom1.index.duplicated(keep='first')]
...dendrom9 = dendrom9.loc[~dendrom9.index.duplicated(keep='first')]
It would be amazing if I could just add one trip, load the folder with new filenames and run the code generating the merged file 'dendrom' for each tree.
Try os.listdir:
d = {}
for i in os.listdir():
if '.xlsx' in i:
df = pd.read_excel(i)
# do all your operations here that you do for every dataframe
...
d[i] = df
And to extract a specific dataframe, use:
print(d[excel file name])
Then it will output the dataframe you wanted.

Looking for a better way do accomplish dataframe to dictionary by series

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Categories