Indexing a Dataframe in steps of 0.5 - python

I have several dataframes that I want to add. Theyre indices range from 0 to 25 in steps of 0.5. Now, when I try to add them, the indices are interpreted differently and the new added dataframe has the indices ordered from "0 to 2" its 0.5,1,1.5,10,10.5...19.5, 2....etc. So that 10 is listed lower than 2, I guess because it starts with a 1 and the dataframe sorts the indices by the first value.
I tried different ways of adding the frames:
pd.concat([df1, df2, df3...], axis=0)
df1 + df2 + df3
df1.add(df2, fill_value=0).add(df3.....)
all of them work. The only problem is the new indexing which messes up my frames.
I could of course reset the indices before adding the frames and then change the index back. But is there a more direct way?
answer to comment:
Index(['0.5', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0',
'5.5', '6.0', '6.5', '7.0', '7.5', '8.0', '8.5', '9.0', '9.5', '10.0',
'10.5', '11.0', '11.5', '12.0', '12.5', '13.0', '13.5', '14.0', '14.5',
'15.0', '15.5', '16.0', '16.5', '17.0', '17.5', '18.0', '18.5', '19.0',
'19.5', '20.0', '20.5', '21.0', '21.5', '22.0', '22.5', '23.0', '23.5',
'24.0', '24.5', '25.0', '25.5', '26.0', '26.5', '27.0', '27.5', '28.0',
'28.5'],
dtype='object') Index(['0.5', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0',
'5.5', '6.0', '6.5', '7.0', '7.5', '8.0', '8.5', '9.0', '9.5', '10.0',
'10.5', '11.0', '11.5', '12.0', '12.5', '13.0', '13.5'],
dtype='object') Index(['0.5', '1.0', '1.5', '2.0', '2.5', '3.0', '3.5', '4.0', '4.5', '5.0',
'5.5', '6.0', '6.5', '7.0', '7.5', '8.0', '8.5', '9.0', '9.5', '10.0',
'10.5', '11.0', '11.5', '12.0', '12.5', '13.0', '13.5', '14.0', '14.5',
'15.0', '15.5', '16.0', '16.5', '17.0', '17.5', '18.0'],
dtype='object')

One simpliest solution is convert index to FloatIndex in all DataFrames:
df1.index = df1.index.astype(float)
df2.index = df2.index.astype(float)
df3.index = df3.index.astype(float)

Related

convert list of lists to dictionary

how can I create a list of dictionaries with those lists
temp = [['header1', '4', '8', '16', '32', '64', '128', '256', '512', '243,6'], ['media_range', '1,200', '2,400', '4,800', '4,800', '6,200', '38,400', '76,800', '153,600', '160,000'], ['speed', '300', '600', '1,200', '2,000', '2,000', '2,000', '2,000', '2,000', '2,000']]
the headers of the dictionary is the first element of the lists
the expected Output is:
output= [{'header1': '4', 'media_range': '1,200', 'speed': '300'}, {'header1': '8', 'media_range': '2,400', 'speed': '600'}, ...]
Ideally the code should handle any amount of lists (in this case 3)
IIUC
>>> temp = [['header1', '4', '8', '16', '32', '64', '128', '256', '512', '243,6'], ['media_range', '1,200', '2,400', '4,800', '4
...: ,800', '6,200', '38,400', '76,800', '153,600', '160,000'], ['speed', '300', '600', '1,200', '2,000', '2,000', '2,000', '2,0
...: 00', '2,000', '2,000']]
>>>
>>> keys = [l[0] for l in temp]
>>> values = [l[1:] for l in temp]
>>> dicts = [dict(zip(keys, sub)) for sub in zip(*values)]
>>>
>>> dicts
[{'header1': '4', 'media_range': '1,200', 'speed': '300'},
{'header1': '8', 'media_range': '2,400', 'speed': '600'},
{'header1': '16', 'media_range': '4,800', 'speed': '1,200'},
{'header1': '32', 'media_range': '4,800', 'speed': '2,000'},
{'header1': '64', 'media_range': '6,200', 'speed': '2,000'},
{'header1': '128', 'media_range': '38,400', 'speed': '2,000'},
{'header1': '256', 'media_range': '76,800', 'speed': '2,000'},
{'header1': '512', 'media_range': '153,600', 'speed': '2,000'},
{'header1': '243,6', 'media_range': '160,000', 'speed': '2,000'}]
Slightly shorter solution with zip and unpacking:
temp = [['header1', '4', '8', '16', '32', '64', '128', '256', '512', '243,6'], ['media_range', '1,200', '2,400', '4,800', '4,800', '6,200', '38,400', '76,800', '153,600', '160,000'], ['speed', '300', '600', '1,200', '2,000', '2,000', '2,000', '2,000', '2,000', '2,000']]
header, *data = zip(*temp)
result = [dict(zip(header, i)) for i in data]
Output:
[{'header1': '4', 'media_range': '1,200', 'speed': '300'}, {'header1': '8', 'media_range': '2,400', 'speed': '600'}, {'header1': '16', 'media_range': '4,800', 'speed': '1,200'}, {'header1': '32', 'media_range': '4,800', 'speed': '2,000'}, {'header1': '64', 'media_range': '6,200', 'speed': '2,000'}, {'header1': '128', 'media_range': '38,400', 'speed': '2,000'}, {'header1': '256', 'media_range': '76,800', 'speed': '2,000'}, {'header1': '512', 'media_range': '153,600', 'speed': '2,000'}, {'header1': '243,6', 'media_range': '160,000', 'speed': '2,000'}]
You could use zip(). This requires you to know how many lists but does the expected output.
for header1,media_range,speed in zip(temp[0], temp[1], temp[2]):
if header1 != "header1":
output.append({temp[0][0]: header1, temp[1][0]: media_range, temp[2][0]: speed})

Print nested dictionary in python and export all on a csv file

I have a dictionary like this:
{'https://github.com/project1': {'Batchfile': '91', 'Gradle': '110', 'INI': '25', 'Java': '1879', 'Markdown': '393', 'QMake': '52', 'Shell': '161', 'Text': '202', 'XML': '943'}}
{'https://github.com/project2': {'Batchfile': '91', 'Gradle': '123', 'INI': '25', 'Java': '1305', 'Markdown': '121', 'QMake': '52', 'Shell': '161', 'XML': '234'}}
{'https://github.com/project3': {'Batchfile': '91', 'Gradle': '360', 'INI': '27', 'Java': '805', 'Markdown': '27', 'QMake': '156', 'Shell': '161', 'XML': '380'}}
It is a structured in this way:
{'url': {'lang1': 'locs', 'lang2': 'locs', ...}}
{'url2': {'lang6': 'locs', 'lang5': 'locs', ...}}
where lang stay for languages and locs stay for line of codes (related to the previous language).
What i want to do is print this dictionary in a pretty way,so i can see the results before the export.
After that i want to export the dictionary into a csv file to make other operation. The problem is the languages are not sorted. That is what i mean:
{'https://github.com/Project4': {'HTML': '29', 'Java': '229', 'Markdown': '101', 'Maven POM': '88', 'XML': '62'}}
{'https://github.com/Project5': {'Batchfile': '85', 'Gradle': '84', 'INI': '22', 'Java': '2422', 'Markdown': '25', 'Prolog': '25', 'Shell': '173', 'XML': '3243', 'YAML': '43'}}
Any idea?
You could use pandas:
import pandas as pd
t = [{'https://github.com/project1': {'Batchfile': '91', 'Gradle': '110', 'INI': '25', 'Java': '1879', 'Markdown': '393', 'QMake': '52', 'Shell': '161', 'Text': '202', 'XML': '943'}},
{'https://github.com/project2': {'Batchfile': '91', 'Gradle': '123', 'INI': '25', 'Java': '1305', 'Markdown': '121', 'QMake': '52', 'Shell': '161', 'XML': '234'}},
{'https://github.com/project3': {'Batchfile': '91', 'Gradle': '360', 'INI': '27', 'Java': '805', 'Markdown': '27', 'QMake': '156', 'Shell': '161', 'XML': '380'}}]
columns = set([lang for x in t for l in x.values() for lang in l])
index = [p for x in t for p in x.keys()]
rows = [l for x in t for l in x.values() ]
df = pd.DataFrame(rows, columns=columns, index=index).fillna('N/A')
df.to_csv('projects.csv')
Which gives:
>>> df
Gradle INI Markdown ... Batchfile Java QMake
https://github.com/project1 110 25 393 ... 91 1879 52
https://github.com/project2 123 25 121 ... 91 1305 52
https://github.com/project3 360 27 27 ... 91 805 156
[3 rows x 9 columns]
And in the csv:

Sorting elements in a list into 3 individual lists python

Let me start off with the code and explain the goal and what i'm getting.
temp1 = ['3.8', 'Weiss, Earl', '139 RATINGS', '2.3', 'Jeppson, Catherine', '114 RATINGS', '3.3', 'Kiani-Aslani, Rajabali', '88 RATINGS', '2.6', 'Lundblad, Heidemarie', '82 RATINGS', '2.4', 'Stone, Ronald', '75 RATINGS', '3.7', 'Vedd, Rishma', '66 RATINGS', '3.3', 'Foster, Robert', '60 RATINGS', '4.9', 'Basmadzhyan, Babken', '59 RATINGS', '4.3', 'Grodsky, Marilyn', '57 RATINGS', '2.4', 'Dorsey, Norris', '53 RATINGS', '2.6', 'Zvinakis, Kristina', '51 RATINGS', '3.2', 'MacKlin, James', '50 RATINGS', '2.8', 'Liu, David', '48 RATINGS', '3.2', 'Doron, Michael', '48 RATINGS', '2.1', 'Rogoff, Donald', '45 RATINGS', '3.1', 'Sangeladji, Mohammad', '43 RATINGS', '4.0', 'Fountaine, Howard', '42 RATINGS', '4.6', 'Stout, Gary', '41 RATINGS', '3.4', 'Gray, Glen', '34 RATINGS', '3.0', 'Wilson, Barbara', '31 RATINGS', '4.0', 'Yoon, Sung-Wook', '31 RATINGS', '4.5', 'Her, Young-Won', '31 RATINGS', '3.0', 'Kiddoo, Robert', '30 RATINGS', '3.0', 'Chiu, J', '27 RATINGS', '3.3', 'Barker, Robert', '25 RATINGS', '3.7', 'Qureshi, Mahmood', '23 RATINGS', '3.7', 'Primes, David', '22 RATINGS', '2.6', 'Chen, Raymond', '20 RATINGS', '3.3', 'Jones, Christopher', '20 RATINGS', '3.2', 'Zhan, Jun', '20 RATINGS', '4.6', 'Bell, Janice', '15 RATINGS', '3.8', 'Alhashim, Dhia D', '12 RATINGS', '2.9', 'Ansari, Shahid', '11 RATINGS', '4.5', 'Rousselet, Robin (rob)', '9 RATINGS', '2.4', 'Lucero, Terrence', '8 RATINGS', '1.0', 'Perez, Marlene', '7 RATINGS', '1.3', 'Crespo, Patricia', '7 RATINGS', '4.8', 'Knight, Ridgeway', '7 RATINGS', '2.5', 'Julius, Ed', '6 RATINGS', '2.9', 'Reinstein, Todd', '6 RATINGS']
So my goal is to sort this giant list into 3 different categorical lists,
Professor names, professor ratings, and amount of ratings
I have developed the following for-loop with the following if statements, and as much as a try to play with it atleast one doesn't work out, Let me show you in the following code
counter = 1
for index in temp1:
if counter % 1 == 0:
pro_rating.append(index)
if counter % 2 == 0:
pro_name.append(index)
if counter % 3 == 0:
pro_amount_rating.append(index)
counter = 0
counter += 1
print("All Professor ratings: ", pro_rating)
print("All professor names: ", pro_name)
print("Amount of times professor rated: ", pro_amount_rating)
Now everything works out pretty well when appending the names (pro_names) and amount of ratings (pro_amount_rating), But pro_rating always writes out the full list.
I completely understand why it's happening, it's because I'm resetting my counter once it hits the 3, and my counter adds 1 at the very end, making the first if statement always true.
I was thinking of placing a flag or a second parameter which would solve this problem but I just can't seem to figure it out, I know I can easily make another forloop to solve this problem but I want to get them all done within this single for loop.
If anyone has any ideas I would appreciate it!
OUTPUT:
All Professor ratings: ['3.8', 'Weiss, Earl', '139 RATINGS', '2.3', 'Jeppson, Catherine', '114 RATINGS', '3.3', 'Kiani-Aslani, Rajabali', '88 RATINGS', '2.6', 'Lundblad, Heidemarie', '82 RATINGS', '2.4', 'Stone, Ronald', '75 RATINGS', '3.7', 'Vedd, Rishma', '66 RATINGS', '3.3', 'Foster, Robert', '60 RATINGS', '4.9', 'Basmadzhyan, Babken', '59 RATINGS', '4.3', 'Grodsky, Marilyn', '57 RATINGS', '2.4', 'Dorsey, Norris', '53 RATINGS', '2.6', 'Zvinakis, Kristina', '51 RATINGS', '3.2', 'MacKlin, James', '50 RATINGS', '2.8', 'Liu, David', '48 RATINGS', '3.2', 'Doron, Michael', '48 RATINGS', '2.1', 'Rogoff, Donald', '45 RATINGS', '3.1', 'Sangeladji, Mohammad', '43 RATINGS', '4.0', 'Fountaine, Howard', '42 RATINGS', '4.6', 'Stout, Gary', '41 RATINGS', '3.4', 'Gray, Glen', '34 RATINGS', '3.0', 'Wilson, Barbara', '31 RATINGS', '4.0', 'Yoon, Sung-Wook', '31 RATINGS', '4.5', 'Her, Young-Won', '31 RATINGS', '3.0', 'Kiddoo, Robert', '30 RATINGS', '3.0', 'Chiu, J', '27 RATINGS', '3.3', 'Barker, Robert', '25 RATINGS', '3.7', 'Qureshi, Mahmood', '23 RATINGS', '3.7', 'Primes, David', '22 RATINGS', '2.6', 'Chen, Raymond', '20 RATINGS', '3.3', 'Jones, Christopher', '20 RATINGS', '3.2', 'Zhan, Jun', '20 RATINGS', '4.6', 'Bell, Janice', '15 RATINGS', '3.8', 'Alhashim, Dhia D', '12 RATINGS', '2.9', 'Ansari, Shahid', '11 RATINGS', '4.5', 'Rousselet, Robin (rob)', '9 RATINGS', '2.4', 'Lucero, Terrence', '8 RATINGS', '1.0', 'Perez, Marlene', '7 RATINGS', '1.3', 'Crespo, Patricia', '7 RATINGS', '4.8', 'Knight, Ridgeway', '7 RATINGS', '2.5', 'Julius, Ed', '6 RATINGS', '2.9', 'Reinstein, Todd', '6 RATINGS']
All professor names: ['Weiss, Earl', 'Jeppson, Catherine', 'Kiani-Aslani, Rajabali', 'Lundblad, Heidemarie', 'Stone, Ronald', 'Vedd, Rishma', 'Foster, Robert', 'Basmadzhyan, Babken', 'Grodsky, Marilyn', 'Dorsey, Norris', 'Zvinakis, Kristina', 'MacKlin, James', 'Liu, David', 'Doron, Michael', 'Rogoff, Donald', 'Sangeladji, Mohammad', 'Fountaine, Howard', 'Stout, Gary', 'Gray, Glen', 'Wilson, Barbara', 'Yoon, Sung-Wook', 'Her, Young-Won', 'Kiddoo, Robert', 'Chiu, J', 'Barker, Robert', 'Qureshi, Mahmood', 'Primes, David', 'Chen, Raymond', 'Jones, Christopher', 'Zhan, Jun', 'Bell, Janice', 'Alhashim, Dhia D', 'Ansari, Shahid', 'Rousselet, Robin (rob)', 'Lucero, Terrence', 'Perez, Marlene', 'Crespo, Patricia', 'Knight, Ridgeway', 'Julius, Ed', 'Reinstein, Todd']
Amount of times professor rated: ['139 RATINGS', '114 RATINGS', '88 RATINGS', '82 RATINGS', '75 RATINGS', '66 RATINGS', '60 RATINGS', '59 RATINGS', '57 RATINGS', '53 RATINGS', '51 RATINGS', '50 RATINGS', '48 RATINGS', '48 RATINGS', '45 RATINGS', '43 RATINGS', '42 RATINGS', '41 RATINGS', '34 RATINGS', '31 RATINGS', '31 RATINGS', '31 RATINGS', '30 RATINGS', '27 RATINGS', '25 RATINGS', '23 RATINGS', '22 RATINGS', '20 RATINGS', '20 RATINGS', '20 RATINGS', '15 RATINGS', '12 RATINGS', '11 RATINGS', '9 RATINGS', '8 RATINGS', '7 RATINGS', '7 RATINGS', '7 RATINGS', '6 RATINGS', '6 RATINGS']
SOLVED:
Thank you Michael for the solution, I was overthinking clearly, Using a simple
counter == 1, 2... solved the problem, instead of using a modulo thanks again
You could use list slicing to achieve what you want with much less code.
pro_rating = temp1[0::3]
pro_name = temp1[1::3]
pro_amount_rating = temp1[2::3]
This would sort the 1st element into rating, the 2nd element into name and the 3rd into amount, repeating for every 3rd element.
It'll be cleaner to change the line counter % 1 == 0 to counter % 3 == 0. Remember that you want every third element so you'll want to modulus with 3 and check that the remainder is 0.
Then you can stop resetting the counter in the third if block and change that to counter % 3 == 2. I'll leave it as an exercise to figure out what the middle if block should be.
You can create a nested list storing the rating, the professor's name, and the number of times the latter has been rated:
from collections import namedtuple
import re
professor = namedtuple('professor', ['rating', 'name', 'ratings'])
d = ['3.8', 'Weiss, Earl', '139 RATINGS', '2.3', 'Jeppson, Catherine', '114 RATINGS', '3.3', 'Kiani-Aslani, Rajabali', '88 RATINGS', '2.6', 'Lundblad, Heidemarie', '82 RATINGS', '2.4', 'Stone, Ronald', '75 RATINGS', '3.7', 'Vedd, Rishma', '66 RATINGS', '3.3', 'Foster, Robert', '60 RATINGS', '4.9', 'Basmadzhyan, Babken', '59 RATINGS', '4.3', 'Grodsky, Marilyn', '57 RATINGS', '2.4', 'Dorsey, Norris', '53 RATINGS', '2.6', 'Zvinakis, Kristina', '51 RATINGS', '3.2', 'MacKlin, James', '50 RATINGS', '2.8', 'Liu, David', '48 RATINGS', '3.2', 'Doron, Michael', '48 RATINGS', '2.1', 'Rogoff, Donald', '45 RATINGS', '3.1', 'Sangeladji, Mohammad', '43 RATINGS', '4.0', 'Fountaine, Howard', '42 RATINGS', '4.6', 'Stout, Gary', '41 RATINGS', '3.4', 'Gray, Glen', '34 RATINGS', '3.0', 'Wilson, Barbara', '31 RATINGS', '4.0', 'Yoon, Sung-Wook', '31 RATINGS', '4.5', 'Her, Young-Won', '31 RATINGS', '3.0', 'Kiddoo, Robert', '30 RATINGS', '3.0', 'Chiu, J', '27 RATINGS', '3.3', 'Barker, Robert', '25 RATINGS', '3.7', 'Qureshi, Mahmood', '23 RATINGS', '3.7', 'Primes, David', '22 RATINGS', '2.6', 'Chen, Raymond', '20 RATINGS', '3.3', 'Jones, Christopher', '20 RATINGS', '3.2', 'Zhan, Jun', '20 RATINGS', '4.6', 'Bell, Janice', '15 RATINGS', '3.8', 'Alhashim, Dhia D', '12 RATINGS', '2.9', 'Ansari, Shahid', '11 RATINGS', '4.5', 'Rousselet, Robin (rob)', '9 RATINGS', '2.4', 'Lucero, Terrence', '8 RATINGS', '1.0', 'Perez, Marlene', '7 RATINGS', '1.3', 'Crespo, Patricia', '7 RATINGS', '4.8', 'Knight, Ridgeway', '7 RATINGS', '2.5', 'Julius, Ed', '6 RATINGS', '2.9', 'Reinstein, Todd', '6 RATINGS']
grouped_data = [d[i:i+3] for i in range(0, len(d), 3)]
results = [professor(float(a), b, int(re.findall('^\d+', c)[0])) for a, b, c in grouped_data]
Output:
[professor(rating=3.8, name='Weiss, Earl', ratings=139), professor(rating=2.3, name='Jeppson, Catherine', ratings=114), professor(rating=3.3, name='Kiani-Aslani, Rajabali', ratings=88), professor(rating=2.6, name='Lundblad, Heidemarie', ratings=82), professor(rating=2.4, name='Stone, Ronald', ratings=75), professor(rating=3.7, name='Vedd, Rishma', ratings=66), professor(rating=3.3, name='Foster, Robert', ratings=60), professor(rating=4.9, name='Basmadzhyan, Babken', ratings=59), professor(rating=4.3, name='Grodsky, Marilyn', ratings=57), professor(rating=2.4, name='Dorsey, Norris', ratings=53), professor(rating=2.6, name='Zvinakis, Kristina', ratings=51), professor(rating=3.2, name='MacKlin, James', ratings=50), professor(rating=2.8, name='Liu, David', ratings=48), professor(rating=3.2, name='Doron, Michael', ratings=48), professor(rating=2.1, name='Rogoff, Donald', ratings=45), professor(rating=3.1, name='Sangeladji, Mohammad', ratings=43), professor(rating=4.0, name='Fountaine, Howard', ratings=42), professor(rating=4.6, name='Stout, Gary', ratings=41), professor(rating=3.4, name='Gray, Glen', ratings=34), professor(rating=3.0, name='Wilson, Barbara', ratings=31), professor(rating=4.0, name='Yoon, Sung-Wook', ratings=31), professor(rating=4.5, name='Her, Young-Won', ratings=31), professor(rating=3.0, name='Kiddoo, Robert', ratings=30), professor(rating=3.0, name='Chiu, J', ratings=27), professor(rating=3.3, name='Barker, Robert', ratings=25), professor(rating=3.7, name='Qureshi, Mahmood', ratings=23), professor(rating=3.7, name='Primes, David', ratings=22), professor(rating=2.6, name='Chen, Raymond', ratings=20), professor(rating=3.3, name='Jones, Christopher', ratings=20), professor(rating=3.2, name='Zhan, Jun', ratings=20), professor(rating=4.6, name='Bell, Janice', ratings=15), professor(rating=3.8, name='Alhashim, Dhia D', ratings=12), professor(rating=2.9, name='Ansari, Shahid', ratings=11), professor(rating=4.5, name='Rousselet, Robin (rob)', ratings=9), professor(rating=2.4, name='Lucero, Terrence', ratings=8), professor(rating=1.0, name='Perez, Marlene', ratings=7), professor(rating=1.3, name='Crespo, Patricia', ratings=7), professor(rating=4.8, name='Knight, Ridgeway', ratings=7), professor(rating=2.5, name='Julius, Ed', ratings=6), professor(rating=2.9, name='Reinstein, Todd', ratings=6)]
You can just use enumerate. Also, if you use n%i, you do not need to set counter to 0. for example, 3%3 == 6%3 == 9%3 = 0, etc ...
You've been wrong on the origin of your problem:
counter % 1 == 0 actually means is counter a multiple of 1 ?
And every number is a multiple of 1.
You avoid the problem with 2 and 3 by resetting to 0 cause they are prime numbers but keep in mind that 6%3 =6%2 = 6%1

How to sort a NumPy array of strings by the last column

Is there a way to sort the rows of an array by the last element, in this case the cell ids. The cell id is build as follows : "CellID_NumberOfCell
arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
['2.0','29.0','24.0','0.0','1_0'],
['0.0','18.0','4.0','0.0','2_0'],
['16.0','9.0','0.0','9990.0','7_203'],
['16.0','9.0','0.0','9990.0','0_203'],
['20.0','23.0','31.0','9990.0','8_158'],
['65.0','30.0','20.0','0.0','0_10']])
So after sorting it should look like:
arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
['65.0','30.0','20.0','0.0','0_10'],
['16.0','9.0','0.0','9990.0','0_203'],
['2.0','29.0','24.0','0.0','1_0'],
['0.0','18.0','4.0','0.0','2_0'],
['16.0','9.0','0.0','9990.0','7_203'],
['20.0','23.0','31.0','9990.0','8_158']])
EDIT:
Is it also possible to delete the numbers after the underscore after sorting?. So that i just have the ID. Instead of 0_0 just 0.
EDIT2
After sorting the ID, it should also sort after time, so that every ID with 0 for example should also be sorted after time 0,1...9999 etc.
np.argsort(arr[:, -1]) will give you the permutation so that elements of the last column of arr are ordered.
Then, arr[np.argsort(arr[:, -1])] reorders the rows of arr according to this permutation.
Beware that the lexicographic order is used since your data consists of string, so 0_10 comes before 0_2. If this is not what you want, you should split the last column, and I advise you to use a pandas.DataFrame:
import pandas as pd
df = pd.DataFrame(arr)
df['Cell'], df['CellIndex'] = df[df.columns[-1]].str.split('_', 1).str
df['Cell'] = df['Cell'].astype(int)
df['CellIndex'] = df['CellIndex'].astype(int)
df.sort_values(['Cell', 'CellIndex'])
pandas is really the way to go to manipulate this kind of data.
We need to split the last column by that underscore, lexsort it and then use those indices to sort the input array.
Thus, an implementation would be -
def numpy_app(arr):
# Extract out the strings on last column split based on '_'.
# Thus, for given sample we would have the last column would be
# split further into 3 columns, the middle one being of '_''s.
a = np.core.defchararray.partition(arr[:,-1],'_')
# Lexsort it on the last numeric cols (0,2). We need to flip
# the order of columns to give precedence to the first string
sidx = np.lexsort(a[:,2::-2].astype(int).T)
# Index into input array with lex-sorted indices for final o/p
return arr[sidx]
Based on the edits in the question, it seems we want to cut out the string after the underscore. To do so, here's a modified version -
def numpy_cut_app(arr):
a = np.core.defchararray.partition(arr[:,-1],'_')
sidx = np.lexsort(a[:,2::-2].astype(int).T)
out = arr[sidx]
# Replace the last column with the first string off the last column's split one
out[:,-1] = a[sidx,0]
return out
Based on more edits, it seems we want to include the fourth column into lex-sorting and neglect everything after the underscore in the last column. So, a further modified version would be -
def numpy_cut_col3_app(arr):
a = np.core.defchararray.partition(arr[:,-1],'_')
# Lex-sort using first off the split strings from last col(precedence to it)
# and col-3 of input array
sidx = np.lexsort([arr[:,3].astype(float), a[:,0]])
out = arr[sidx]
out[:,-1] = a[sidx,0]
return out
Sample runs -
In [567]: arr
Out[567]:
array([['65.0', '30.0', '20.0', '0.0', '9_49'],
['2.0', '29.0', '24.0', '0.0', '1_0'],
['0.0', '18.0', '4.0', '0.0', '2_0'],
['16.0', '9.0', '0.0', '9990.0', '7_203'],
['16.0', '9.0', '0.0', '9990.0', '9_5'],
['20.0', '23.0', '31.0', '9990.0', '8_158'],
['65.0', '30.0', '20.0', '0.0', '9_50']],
dtype='|S6')
In [568]: numpy_app(arr)
Out[568]:
array([['2.0', '29.0', '24.0', '0.0', '1_0'],
['0.0', '18.0', '4.0', '0.0', '2_0'],
['16.0', '9.0', '0.0', '9990.0', '7_203'],
['20.0', '23.0', '31.0', '9990.0', '8_158'],
['16.0', '9.0', '0.0', '9990.0', '9_5'],
['65.0', '30.0', '20.0', '0.0', '9_49'],
['65.0', '30.0', '20.0', '0.0', '9_50']],
dtype='|S6')
In [569]: numpy_cut_app(arr)
Out[569]:
array([['2.0', '29.0', '24.0', '0.0', '1'],
['0.0', '18.0', '4.0', '0.0', '2'],
['16.0', '9.0', '0.0', '9990.0', '7'],
['20.0', '23.0', '31.0', '9990.0', '8'],
['16.0', '9.0', '0.0', '9990.0', '9'],
['65.0', '30.0', '20.0', '0.0', '9'],
['65.0', '30.0', '20.0', '0.0', '9']],
dtype='|S6')
You can do it easely with sorted and lambda function and as suggested by #Divakar to get the numpy array back:
np.array(sorted(arr, key=lambda x :x[-1]))
output
[['65.0', '30.0', '20.0', '0.0', '0_0'],
['65.0', '30.0', '20.0', '0.0', '0_10'],
['16.0', '9.0', '0.0', '9990.0', '0_203'],
['2.0', '29.0', '24.0', '0.0', '1_0'],
['0.0', '18.0', '4.0', '0.0', '2_0'],
['16.0', '9.0', '0.0', '9990.0', '7_203'],
['20.0', '23.0', '31.0', '9990.0', '8_158']]
EDIT :
you can do it by using this, not pretty, but does the work
np.array([ np.append(i[:-1],i[-1].split("_")[0]) for i in sorted(list(arr), key=lambda x :x[-1])])
ouput
array([['65.0', '30.0', '20.0', '0.0', '0'],
['65.0', '30.0', '20.0', '0.0', '0'],
['16.0', '9.0', '0.0', '9990.0', '0'],
['2.0', '29.0', '24.0', '0.0', '1'],
['0.0', '18.0', '4.0', '0.0', '2'],
['16.0', '9.0', '0.0', '9990.0', '7'],
['20.0', '23.0', '31.0', '9990.0', '8']],
dtype='<U6')

sorting by dictionary value in array python

Okay so I've been working on processing some annotated text output. What I have so far is a dictionary with annotation as key and relations an array of elements:
'Adenotonsillectomy': ['0', '18', '1869', '1716'],
'OSAS': ['57', '61'],
'apnea': ['41', '46'],
'can': ['94', '97', '1796', '1746'],
'deleterious': ['103', '114'],
'effects': ['122', '129', '1806', '1752'],
'for': ['19', '22'],
'gain': ['82', '86', '1776', '1734'],
'have': ['98', '102', ['1776 1786 1796 1806 1816'], '1702'],
'health': ['115', '121'],
'lead': ['67', '71', ['1869 1879 1889'], '1695'],
'leading': ['135', '142', ['1842 1852'], '1709'],
'may': ['63', '66', '1879', '1722'],
'obesity': ['146', '153'],
'obstructive': ['23', '34'],
'sleep': ['35', '40'],
'syndrome': ['47', '55'],
'to': ['143', '145', '1852', '1770'],
'weight': ['75', '81'],
'when': ['130', '134', '1842', '1758'],
'which': ['88', '93', '1786', '1740']}
What I want to do is sort this by the first element in the array and reorder the dict as:
'Adenotonsillectomy': ['0', '18', '1869', '1716']
'for': ['19', '22'],
'obstructive': ['23', '34'],
'sleep': ['35', '40'],
'apnea': ['41', '46'],
etc...
right now I've tried to use operator to sort by value:
sorted(dependency_dict.items(), key=lambda x: x[1][0])
However the output I'm getting is still incorrect:
[('Adenotonsillectomy', ['0', '18', '1869', '1716']),
('deleterious', ['103', '114']),
('health', ['115', '121']),
('effects', ['122', '129', '1806', '1752']),
('when', ['130', '134', '1842', '1758']),
('leading', ['135', '142', ['1842 1852'], '1709']),
('to', ['143', '145', '1852', '1770']),
('obesity', ['146', '153']),
('for', ['19', '22']),
('obstructive', ['23', '34']),
('sleep', ['35', '40']),
('apnea', ['41', '46']),
('syndrome', ['47', '55']),
('OSAS', ['57', '61']),
('may', ['63', '66', '1879', '1722']),
('lead', ['67', '71', ['1869 1879 1889'], '1695']),
('weight', ['75', '81']),
('gain', ['82', '86', '1776', '1734']),
('which', ['88', '93', '1786', '1740']),
('can', ['94', '97', '1796', '1746']),
('have', ['98', '102', ['1776 1786 1796 1806 1816'], '1702'])]
I'm not sure whats going wrong. Any help is appreciated.
The entries are sorted in alphabetical order. If you want to sort them on integer value, convert the value to int first:
sorted(dependency_dict.items(), key=lambda x: int(x[1][0]))

Categories