Pandas: Sort DataFrame Cells across columns if unordered

Pandas: Sort DataFrame Cells across columns if unordered - python

Looks like my last question was closed, but I forgot to mention the update below the first time. Only modifying a few of the columns and not all.
What is the best way to modify (sort) a Series of data in a Pandas DataFrame?
For example, after importing some data, colums should be in ascending order, but I need to reorder data if it is not. Data is being imported from a csv into a pandas.df.
num_1 num_2 num_3
date
2020-02-03 17 22 36
2020-02-06 52 22 14
2020-02-10 5 8 29
2020-02-13 10 14 30
2020-02-17 7 8 19
I would ideally find the second row (panda Series) in the Dataframe as the record to be fixed:
num_1 num_2 num_3 num_4 num_5
date
2020-02-06 52 22 14 25 27
And modify it to be: (Only sorting nums 1-3 and not touching columns 4 & 5)
num_1 num_2 num_3 num_4 num_5
date
2020-02-06 14 22 52 25 27
I could iterate over the DataFrame and check for indexes that have Series data out of order by comparing each column to the column to it's right. Then write a custom sorter and rewrite that record back into the Dataframe, but that seems clunky.
I have to imagine there's a more Pythonic (Pandas) way to do this type of thing. I just can't find it in the pandas documentation. I don't want to reorder the rows just make sure the values are in the appropriate order within the columns.
Update: I forgot to mention one of the most critical aspects. There are other columns in the DataFrame that should not be touched. So in the example below, only sort (num_1, num_2, num_3) not the others. I'm guessing I can use the solutions posed already, split the DataFrame, sort the first part and re-merge them together. Is there an alternative?

Spliting and reconnecting does not sound bad to me, here is what I got:
cols_to_sort = ['num_1', 'num_2', 'num_3']
pd.concat([pd.DataFrame(np.sort(df[cols_to_sort].values), columns=cols_to_sort, index=df.index), df[df.columns[~df.columns.isin(cols_to_sort)]]], axis=1)

The best way is to use the sort_values() function and only allow it work on the columns which require sorting.
for index, rows in df.iterrows():
df[['col1','col2','col3']] = df[['col1','col2','col3']].sort_values(by=[index], axis = 1, ascending = True)
This loops through every row, makes the values in the desired columns ascending, and then resaves the columns.

Pandas does not support what you ask for by default (as much as I know). Usually each column is a different feature, so changing it's value may seem a little odd.
Anyway, pandas work extremely well with numpy. This is your rescue.
You can convert relevant columns to numpy array, sort by row, and then put the result back in the dataframe.
import numpy as np
cols_list = ["num_1","num_2","num_3"]
tmp_arr = np.array(df.loc[:, cols_list])
tmp_arr.sort(axis=1)
df.loc[:, cols_list] = tmp_arr
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Day":range(1,5),"num_1":[5,2,7,1], "num_2":[2,7,4,10], "num_3":[7,27,64,10]})
print(df)
cols_list = ["num_1","num_2","num_3"]
tmp_arr = np.array(df.loc[:, cols_list])
tmp_arr.sort(axis=1)
df.loc[:, cols_list] = tmp_arr
print(df)
The first print result:
Day num_1 num_2 num_3
0 1 5 2 7
1 2 2 7 27
2 3 7 4 64
3 4 1 10 10
second print result:
Day num_1 num_2 num_3
0 1 2 5 7
1 2 2 7 27
2 3 4 7 64
3 4 1 10 10
You can select whatever columns you like (cols_list).
After I already wrote this I found the similar solution here: Fastest way to sort each row in a pandas dataframe

Related

Issue with TypeError when looping through columns in a list of data frames

I have a list of data frames dataframes a list of names keeplist and a dict Hydrocap.
I am trying to loop through the columns of each data frame based on the column names keeplist while applying a where function in the column loop to replace the value in the column to that in the dictionary value (for its respective key) if it is greater than the dictionary value. The issue is I run into a TypeError: '>=' not supported between instances of 'str' and 'int' where I am not sure how to solve the issue.
keeplist = ['BOUND','GCOUL','CHIEF','ROCKY','WANAP','PRIRA','LGRAN','LMONU','ICEHA','MCNAR','DALLE']
HydroCap = {'BOUND':55000,'GCOUL':280000,'CHIEF':219000,'ROCKY':220000,'WANAP':161000,'PRIRA':162000,'LGRAN':130000,'LMONU':130000,'ICEHA':106000,'MCNAR':232000,'DALLE':375000}
for i in dataframes:
for c in i[keeplist]:
c = np.where(c >= HydroCap[c], HydroCap[c], c)
Any push in the right direction would be greatly appreciated. I think the issue is that it is expecting an index value in place for HydroCap[1] instead of HydroCap[c] but, that is a hunch.
first 7 columns of dataframe[0]
Week Month Day Year BOUND GCOUL CHIEF \
0 1 8 5 1979 44999.896673 161241.036388 166497.578098
1 2 8 12 1979 15309.259762 58219.122747 63413.204052
2 3 8 19 1979 15316.965781 56072.024363 60606.956215
3 4 8 26 1979 14371.269016 58574.003087 63311.569888

import pandas as pd
import numpy as np
# Since I don't have all of the dataframes, I just use the sample you shared
df = pd.read_csv('dataframe.tsv', sep = "\t")
# Note, I've changed some values so you can see something actually happens
keeplist = ['BOUND','GCOUL','CHIEF']
HydroCap = {'BOUND':5500,'GCOUL':280000,'CHIEF':21900}
# The inside of the loop has been changed to accomplish the actual goal
# First, there are now two variables inside the loop: col, and c
# col is the column
# c represents a single element in that column at a time
# The code operates over a column at a time,
# using a list comprehension to cycle over each element
# and replace the full column with the new values at once
for col in df[keeplist]:
df[col] = [np.where(c >= HydroCap[col], HydroCap[col], c) for c in df[col]]
Which produces:
df
Week
Month
Day
Year
BOUND
GCOUL
CHIEF
0
1
8
5
1979
5500.0
161241.036388
21900.0
1
2
8
12
1979
5500.0
58219.122747
21900.0
2
3
8
19
1979
5500.0
56072.024363
21900.0
3
4
8
26
1979
5500.0
58574.003087
21900.0
In order to replace elements in a dataframe, you either need to go a whole column at a time, or reassign values to a cell specified by row and column coordinates. Reassigning the c variable in your original code—assuming it represented the cell values you had in mind, and not the column name as was the case—doesn't alter anything in the dataframe.

Filtering after Multi Indexing in pandas iterable indexing

I want to make a subset of my dataFrame object using pandas or any other python liberary using Hierarchical indexing that can be iterable depending on number of rows I have in one of the column.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv(address)
trajectory frame x y
1 1 447,956 2,219
1 2 447,839 2,327
1 3 449,183 1,795
1 4 450,444 1,833
1 5 448,514 1,708
1 6 451,532 1,832
1 7 448,471 1,759
1 8 450,028 2,097
1 9 448,215 2,203
1 10 449,311 2,063
1 11 451,745 1,76
1 12 450,827 2,264
1 13 448,991 2,208
1 14 452,829 3,106
1 15 448,688 1,77
1 16 449,844 1,951
1 17 450,044 1,991
1 18 449,835 1,901
1 19 450,793 3,49
1 20 449,618 2,354
2 1 445.936 7.219
2 2 442.879 3.327
3 1 441.283 9.795
4 1 447.956 2.219
4 3 447.839 2.327
4 6 449.183 1.795
In this DataFrame, let say there are 4 columns, names: 'trajectory', 'frame, 'x' and 'y'. Number of 'trajectory' can be different from one dataframe to another. Each 'trajectory' can have multiple frames between 1 and 20, where they can be sequential from 1-20 or with some missing frames as well. Each frame has its own value in the column 'x' and 'y'.
My aim is to create a new dataframe where I can have only those 'trajectory' where the 'frame' values is present for all the 20 rows. As the number of rows in 'trajectory' and 'frame' columns are changing, so I would like to have a code that can be used in such conditions.
df_1 = df.set_index(['trajectory','frame'], drop=False)
Here, I did a heirarchical indexing using 'trajectory' and 'frame' and then I found that 'trajectory' number 1 and 6 have 20 frames in them. So I could manually select them using the following code.
df_1_subset = df_1[(df1['traj']== 1)|(df1['trajectory']== 6)]
However, I have multiple csv files where in each Dataframe, the 'trajectory' that will have 20 rows in the 'frame' column will be different, so I will have to do this manually. I think, there must be a better way, but I just can not seem to find it. I am very new to coding and I would really appreciate anybody's help. Thank you very much in advance.

Iif need filter trajectory level for 1 or 6 use Index.get_level_values with Index.isin:
df_1_subset = df_1[df1.index.get_level_values('trajectory').isin([1,6])]
If need filter trajectory level for 1 and frame for 6 select with DataFrame.loc with tuple:
df_1_subset = df_1.loc[(1, 6)]
Alternative:
df_1_subset = df_1.loc[(df1.index.get_level_values('trajectory') == 1) |
(df1.index.get_level_values('frame') == 6)]

pandas DataFrame sum method works counterintuitively

my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.

Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64

Chain of Calculated Columns in a Pandas Dataframe

I'm converting a financial spreadsheet into Pandas, and this is a frequent challenge that comes up.
In excel, suppose you have some calculation that for columns 0:n, the value depends on the previous column [shown in format Cell (row, column)]: Cell(1,n) = (Cell(1,n-1)^2)*5.
Obviously, for n=2, you could create a calculated column in Pandas:
df[2] = (df[1]^2) *5
But for a chain of say 30, that doesn't work. So currently, I am using a for loop.
total_columns_needed = list(range(0,100))
for i in total_columns_needed:
df[i] = (df[i-1]^2)* 5
That loop works fine, but I trying to see how I could use map and apply to make this look cleaner. From reading, apply is a loop function underneath, so I'm not sure whether I will get any speed from doing this. But, it could shrink the code by a lot.
The problem that I've had with:
df.apply()
is that 1) there could be other columns not involved in the calculation (which arguably shouldn't be there if the data is properly normalised), and 2) the columns don't exist yet. Part 2 could possibly be solved by creating the dataframe with all the needed columns, but I'm trying to avoid that for other reasons.
Any help in solving this greatly appreciated!

To automatically generate a bunch of columns, without a loop:
In [433]:
df = pd.DataFrame({'Val': [0,1,2,3,4]})
In [434]:
print df.Val.apply(lambda x: pd.Series(x+np.arange(0,25,5)))
0 1 2 3 4
0 0 5 10 15 20
1 1 6 11 16 21
2 2 7 12 17 22
3 3 8 13 18 23
4 4 9 14 19 24
numpy.arange(0,25,5) gives you array([ 0, 5, 10, 15, 20]). For each of the values in Val, we will add that value to array([ 0, 5, 10, 15, 20]), creating a new Series.
And finally, put the new Series together back into a new DataFrame

How to select and delete columns with duplicate name in pandas DataFrame

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?

You can adress columns by index:
>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
a a
0 1 2
1 3 4
2 5 6
>>> df.iloc[:,0]
0 1
1 3
2 5
Or you can rename columns, like
>>> df.columns = ['a','b']
>>> df
a b
0 1 2
1 3 4
2 5 6

This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.
In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.
You can also create copies of the columns with new names, such as:
dataframe["new_name"] = data_frame.ix[:, column_position].values
where column_position references the positional location of the column you're trying to get (not the name).
These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.

Another solution:
def remove_dup_columns(frame):
keep_names = set()
keep_icols = list()
for icol, name in enumerate(frame.columns):
if name not in keep_names:
keep_names.add(name)
keep_icols.append(icol)
return frame.iloc[:, keep_icols]
import numpy as np
import pandas as pd
frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])
print(frame)
print(remove_dup_columns(frame))
The output is
A A B B
0 18 44 13 47
1 41 19 35 28
2 49 0 30 16
3 39 29 43 41
4 26 19 48 13
A B
0 18 13
1 41 35
2 49 30
3 39 43
4 26 48

The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't
def remove_multiples(df,varname):
"""
makes a copy of the first column of all columns with the same name,
deletes all columns with that name and inserts the first column again
"""
from copy import deepcopy
dfout = deepcopy(df)
if (varname in dfout.columns):
tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
del dfout[varname]
dfout[varname] = tmp
return dfout
where
[i for i,x in enumerate(dfout.columns == varname) if x]
is the part you need

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.