Order file based on numbers in name - python

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif

take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']

if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']

What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])

As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.

If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']

right click your folder and click sort by >> name.

Related

i have a comma separated list of email ids in a column in python. I want to extract unique list of domain names (sorted) in a new column

I have a column in a python data frame with comma separated list of email ids. I want to extract unique list of domain names, sorted in alphabetical order.
Email Ids
Required Output
jgj#myu.com
myu.com
abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com
gmail.com, svc.com, yyy.com
zya#try.com,abs#cba.com
cba.com, try.com
I tried the following code, however its returning the output of first row for all rows
def Dom1(lpo):
mylist1 = []
for i in lpo:
domain = str(i).split("#")[1]
domain1=domain.replace('>','')
domain1=domain1.replace(']'," ")
if domain1 not in mylist1:
mylist1.append(domain1)
mylist1=sorted(mylist1, key=str.lower)
return mylist1
df['Email_Id1']=df.apply(lambda row: Dom1(df['Email_Id']),axis=1)
How to fix this issue?
I assume that the column Email_Id is a list of email ids.
Here is how your dataframe should look. All the values should be a list even if it has only 1 item. I have a feeling that a single email is not being stored as a list of strings and this is probably your source of error.
df = pd.DataFrame({ 'Email_Id': [['jgj#myu.com'], ['abc#gmail.com', 'lll#yyy.com', 'xyz#svc.com,abc#yyy.com'], ['zya#try.com','abs#cba.com']] })
df
Initial Dataframe
And then with a few minor changes and cleanup here is how you can apply the lambda function.
Apply it to only a series instead of the whole dataframe.
Also I am not sure why you are calling
domain1=domain.replace('>','') and domain1=domain1.replace(']'," ") domain names should not have such characters.
You don't need to sort after every insertion. Just sort it while returning the list as it will be called only once.
Change your variable names so that they make sense.
You could use a python set, but if you do not have a lot of emails in a single row, a list should do just fine
def get_domain(emails):
domains = []
for email in emails:
d = str(email).split("#")[1]
if d not in domains:
domains.append(d)
return sorted(domains, key=str.lower)
df['Email_Id1'] = df['Email_Id'].apply(lambda x: get_domain(x))
df
Final Dataframe
I would simply do a one-liner here:
df["domains"]=df["emails"].apply(lambda row: [ email[email.find("#")+1:] for email in row]).apply(sorted)
import re
col1 = ['jgj#myu.com', 'abc#gmail.com, lll#yyy.com,xyz#svc.com,abc#yyy.com', 'zya#try.com,abs#cba.com']
df1 = pd.DataFrame({'Email Ids':col1})
def getUniqueEmail(st1):
result_obj = {}
for i in st1.split(','):
if i not in result_obj:
result_obj[re.sub('^.+#','', i)] = 1
return ','.join(sorted(list(result_obj.keys()), key=str.lower))
df1['Required output'] = df1['Email Ids'].apply(lambda x: getUniqueEmail(x))

Extracting values to new columns with pandas

I have a dataframe where the coordinates column comes in this format
[-7.821, 37.033]
I would like to create two columns where the first is lonand the second is lat
I've tried
my_dict = df_map['coordinates'].to_dict()
df_map_new = pd.DataFrame(list(my_dict.items()),columns = ['lon','lat'])
But the dictionary that is created does not split the values between ,
Instead it creates a dict with the following format
0: '[-7.821, 37.033]'
What is the best way to extract the values within [,] and put them into two new columns in the original dataframe df_map?
Thank you in advance!
You can parse string:
pattern = r"\[(?P<lon>.*),\s*(?P<lat>.*)\]"
out = df_map['coordinates'].str.extract(pattern).astype(float)
print(out)
# Output
lon lat
0 -7.821 37.033
Convert values to lists by ast.literal_eval, then to lists instead dicts:
import ast
my_L = df_map['coordinates'].apply(ast.literal_eval).tolist()
df_map_new = pd.DataFrame(my_L,columns = ['lon','lat'])
Additionally to the answers already provided, you can also try this:
ser_lon = df['coordinates'].apply(lambda x: x[0])
ser_lat = df['coordinates'].apply(lambda x: x[1])
df_map['lon'] = ser_lon
df_map['lat'] = ser_lat

Looking for a better way do accomplish dataframe to dictionary by series

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Sort a list based on values from dataframe

Is my approach here the right way to do it in Python? As I'm new to Python, I appreciate any feedback you can provide, especially if I'm way off here.
My task is to order a list of file names based on values from a dataset. Specifically, these are file names that I need to sort based on site information. The resulting list is the order in which the reports will be printed.
Site Information
key_info = pd.DataFrame({
'key_id': ['1010','3030','2020','5050','4040','4040']
, 'key_name': ['Name_A','Name_B','Name_C','Name_D','Name_E','Name_E']
, 'key_value': [1,2,3,4,5,6]
})
key_info = key_info[['key_id','key_name']].drop_duplicates()
key_info['key_id'] = key_info.key_id.astype('str').astype('int64')
Filenames
These are the file names I need to sort. In this example, I sort by just the key_id, but I assume I could easily add a column to site information and sort it by that as well.
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
Sorting
The resulting "filenames" is the final sorted list.
names_df = pd.DataFrame({'filename': filenames})
names_df['key_id'] = names_df.filename.str[:4].astype('str').astype('int64')
merged_df = pd.merge(key_info, names_df, on='key_id', how='right')
merged_df = merged_df.sort_values('key_id')
filenames = merged_df['filename'].tolist()
I'm looking for any solutions that might be better or more Pythonic. Or, if there is a more appropriate place to post "code review" questions.
I like your use of Pandas, but it isn't the most Pythonic as it uses data structures that are a superset of Python. Nevertheless, I think we can improve on what you have. I will show an improved version and I will show a completely native Python way to do it. Either is fine I suppose?
The strictly Python version is best for people who do know Pandas as there's a large learning curve associated with that.
Common
For both examples, let's assume a function like this:
def trim_filenames(filename):
return filename[0:4]
I use this in both examples.
Improvements
# Load the DataFrame and give it a proper index (I added some data)
key_info = pd.DataFrame(index=['2020','5050','4040','4040','6000','7000','1010','3030'], data={'key_name':['Name_C','Name_D','Name_E','Name_E','Name_F','Name_G','Name_A','Name_B'], 'key_value' :[1,2,3,4,5,6,7,8]})
# Eliminate duplicates and sort in one step
key_info = key_info.groupby(key_info.index).first()
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
names_df = pd.DataFrame({'filename': filenames})
# Let's give this an index too so we can match on the index (not the function call)
names_df.index=names_df.filename.transform(trim_filenames)
combined = pd.concat([key_info,names_df], axis=1)
combined matches by index, but there are some keys with no filenames. It looks like this now:
key_name key_value filename
1010 Name_A 7 1010_Filename
2020 Name_C 1 2020_Filename
3030 Name_B 8 3030_Filename
4040 Name_E 3 4040_Filename
5050 Name_D 2 5050_Filename
6000 Name_F 5 NaN
7000 Name_G 6 NaN
Now we drop the NaN columns and create the list of filenames:
combined.filename.dropna().values.tolist()
['1010_Filename', '2020_Filename', '3030_Filename', '4040_Filename', '5050_Filename']
Python Only Version (no framework)
key_info = {'2020' : {'key_name':'Name_C', 'key_value':1},'5050' : {'key_name':'Name_D', 'key_value':2},'4040' : {'key_name':'Name_E', 'key_value':3},'4040' : {'key_name':'Name_E', 'key_value':4},'6000' : {'key_name':'Name_F', 'key_value':5},'7000' : {'key_name':'Name_G', 'key_value':6},'1010' : {'key_name':'Name_A', 'key_value':7},'3030' : {'key_name':'Name_B', 'key_value':8}}
filenames = ['1010_Filename','2020_Filename','3030_Filename','5050_Filename','4040_Filename']
# Let's get a dictionary of filenames that is keyed by the same key as in key_info:
hashed_filenames = {}
for filename in filenames:
# Note here I'm using the function again
hashed_filenames[trim_filenames(filename)] = filename
# We'll store the new filenames in new_filenames:
new_filenames = []
# sort the key info and loop it
for key in sorted(key_info.keys()):
# for each key, if the key matches in the hashed_filenames, then add it to the list
if key in hashed_filenames:
new_filenames.append(hashed_filenames[key])
Summary
Both solutions are concise, and I like Pandas, but I like better something that is immediately readable by anyone that knows Python. The Python only solution (of course, they are both Python) is the one you should go with, in my opinion.
out_list = []
for x in key_info.key_id:
for f in filenames:
if str(x) in f:
out_list.append(f)
out_list
['1010_Filename', '3030_Filename', '2020_Filename', '5050_Filename', '4040_Filename']

Python Group Array by Column and Display Unique Values

I have an Array of Arrays with following format:
x = [["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]]
I want to group by the ids and display all the unique usernames
How would I get an output that is like:
id3: Username1, Username3
id4: Username1
Edit: Was able to group by second column but I cannot only display unique values. Here is my code:
data={}
for key, group in groupby(sorted(x), key=lambda x: x[1]):
data[key]=[v[0] for v in group]
print(data)
Use dict to create unique keys by id and pythons sets to store values ( so you would store only unique names for that keys):
items = [
["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]
]
data = {}
for item in items:
if data.has_key(item[1]):
data[item[1]].add(item[0])
else:
data[item[1]] = set([item[0]])
print(data)
You may use a for loop but using a linq statement might be cleaner for future usage.
https://stackoverflow.com/a/3926105/4564614
has some great ways to incorpurate linq to solve this issue. I think what you are looking for would be grouping by.
Example:
from collections import defaultdict
from operator import attrgetter
def group_by(iterable, group_func):
groups = defaultdict(list)
for item in iterable:
groups[group_func(item)].append(item)
return groups
group_by((x.foo for x in ...), attrgetter('bar'))

Categories