I have the following input:
col1 col2 col3
1 4 0
0 12 2
2 12 4
3 2 1
I want to sort the DataFrame according to the values in the columns, e.g. sorting it primarily for df[df==0].count() and secondarily for df.sum() would produce the output:
col2 col3 col1
4 0 1
12 2 0
12 4 2
2 1 3
pd.DataFrame.sort() takes a colum object as argument, which does not apply here, so how can I achieve this?
Firstly, I think your zero count is increasing from right to left whereas your sum is decreasing, so I think you need to clarify that. You can get the number of zero rows simply by (df == 0).sum().
To sort by a single aggregate, you can do something like:
col_order = (df == 0).sum().sort(inplace=False).index
df[col_order]
This sorts the series of aggregates by its values and the resulting index is the columns of df in the order you want.
To sort on two sets of values would be more awkward/tricky but you could do something like
aggs = pd.DataFrame({'zero_count': (df == 0).sum(), 'sum': df.sum()})
col_order = aggs.sort(['zero_count', 'sum'], inplace=False).index
df[col_order]
Note that the sort method takes an ascending parameter which takes either a Boolean or a list of Booleans of equal length to the number of columns you are sorting on, e.g.
df.sort(['a', 'b', ascending=[True, False])
Related
I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7
I have the following dataframe:
ID col1 col2 col3
0 ['a','b'] ['d','c'] ['e','d']
1 ['s','f'] ['f','a'] ['d','aaa']
Give an input string = 'a'
I want to receive a dataframe like this:
ID col1 col2 col3
0 1 0 0
1 0 1 0
I see how to do it with a for loop but that takes forever, and there must be a method I miss
Processing lists in pandas is not vectorized supported, so performance is worse like scalars.
First idea is reshape lists columns to Series by DataFrame.stack, create scalars by Series.explode, so possible compare by a, test if match per first levels by Series.any, and last reshape back with convert boolean mask to integers:
df1 = df.set_index('ID').stack().explode().eq('a').any(level=[0,1]).unstack().astype(int)
print (df1)
col1 col2 col3
ID
0 1 0 0
1 0 1 0
Or is possible use DataFrame.applymap for elementwise testing by lambda function with in:
df1 = df.set_index('ID').applymap(lambda x: 'a' in x).astype(int)
Or create for each lists column DataFrame, so possible test by a with DataFrame.any:
f = lambda x: pd.DataFrame(x.tolist(), index=x.index).eq('a').any(axis=1)
df1 = df.set_index('ID').apply(f).astype(int)
I am trying to sort within multiple levels of a groupby multi-index dataframe based on an aggregated value. For an idea of what I'm talking about:
I have a hierarchal data set that is then grouped on multiple levels. I then aggregate and sum a certain measure on them. I then want to rank them within each other.
At level 0, values should be ranked in descending based on sum of said measure. Then within level 1, values should be ranked in descending order again based on sum of said measure, level 2 and so on..
After a groupby, how do I sort at every level?
I know not providing an example is tough, but if I could be pointed in the right direction that'd be great, thanks
EDIT:
Original Data:
pd.DataFrame(data=[['a','car',6], ['a','bike',7], ['a','car',8], ['b','bike',9], ['b','car',10], ['b','bike',11]], columns=['a', 'b', 'c'])
Groupby:
df.groupby(['a','b']).agg({'c':'sum'})
Desired Output after resetting index:
pd.DataFrame(data=[['b','bike',20], ['b','car',10], ['a','car',14], ['a','bike',7]], columns=['a', 'b', 'c'])
Updated Answer
I will break this up into multiple steps (note that I changed your column names for clarity purposes, i.e. df.columns=['Col1','Col2','Col3']):
Col1 Col2 Col3
0 a car 6
1 a bike 7
2 a car 8
3 b bike 9
4 b car 10
5 b bike 11
Step 1
We first want to groupby('Col1') and use transform(sum) to transform the dataframe based on the sum of the values in Col3 associated with a given group. This sets your order for Col1 by using sort_values('Col3', ascending=False) to store the resulting index and use that to set the index of the original dataframe df.
step1 = df.iloc[df.groupby('Col1').transform(sum).sort_values('Col3', ascending=False).index]
Which gives:
Col1 Col2 Col3
3 b bike 9
4 b car 10
5 b bike 11
0 a car 6
1 a bike 7
2 a car 8
Step 2
Now we can simply group by Col1 and Col2, using sort=False to preserve the sort order from Step 1, and aggregate based on the sum of Col3. Use reset_index() to clean up the index and restore your original columns.
step2 = step1.groupby(['Col1','Col2'], sort=False).agg({'Col3': 'sum'}).reset_index()
Your desired output:
Col1 Col2 Col3
0 b bike 20
1 b car 10
2 a car 14
3 a bike 7
I'd like to change zero values in a dataframe for the value found in the last column for each row. I can solve this using a for in the columns or the rows, but it didnt seem too pythonic to me.
In short, I have a dataframe like this:
col1 col2 col3 nonzero
1 2 0 10
1 0 3 20
and I'd like to do an operation like
df[df==0] = df.nonzero
so I'd get
col1 col2 col3 nonzero
1 2 10 10
1 20 3 20
This however does not work, as [df==0] is a DataFrame itself with True/False values. How can this be done?
One option is to use apply method, loop through rows of the data frame and replace zeros with the last element of the row:
df.apply(lambda row: row.where(row != 0, row.iat[-1]), axis=1)
You can also modify the data frame in place:
df[df == 0] = (df == 0).mul(df.nonzero, axis=0)
Which yields df as the same result above. In this method, (df == 0).mul(df.nonzero, axis=0) creates a data frame with zeros entries replaced by the values in the nonzero column and other entries zero; Combined with boolean indexing and assignment, you can conditionally modify the zero entries in the original data frame:
(df == 0).mul(df.nonzero, axis=0)
i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1
Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1
It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.
If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.