I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7
I am trying to split a dataframe column into multiple columns as under:
There are three columns overall. Two should rename in the new dataframe, while the third one to be split into new columns.
Split is to be done using a specific character (say ":")
The column that requires split can have varied number of ":" split. So new columns can be different for different rows, leaving some column values as NULL for some rows. That is okay.
Each subsequently formed column has a specific name. Max number of columns that can be formed is known.
There are four dataframes. Each one has this same formatted column that has to be split.
I came across following solutions but they don't work for the reasons mentioned:
Link
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
This creates columns with names as 0,1,2... I need the new columns to have specific names.
Link
df = df.apply(lambda x:pd.Series(x))
This does no change to the dataframe. Couldn't understand why.
Link
df['command'], df['value'] = df[0].str.split().str
Here the column names are renamed properly, but this requires knowing beforehand how many columns will be formed. In my case, it is dynamic for each dataframe. For rows, the split successfully puts NULL value in extra columns. But using the same code for another dataframe generates an error saying number of keys should be same.
I couldn't post comments on these answers as I am new here on this community. I would appreciate if someone can help me understand how I can achieve my objective - which is: Dynamically use same code to split one column into many for different dataframes on multiple occasions while renaming the newly generated columns to predefined name.
For example:
Dataframe 1:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D:E A
2 A A:B A
Dataframe 2:
Col1 Col2 Col3
0 A A:B:C A
1 A A:B:C:D A
2 A A:B A
Output should be:
New dataframe 1:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D E A
2 A A B NaN NaN NaN A
New dataframe 2:
Col1 ColA ColB ColC ColD ColE Col3
0 A A B C NaN NaN A
1 A A B C D NaN A
2 A A B NaN NaN NaN A
(If ColE is not there, then also it is fine.)
After this, I will be concatenating these dataframes into one, where I will need counts of all ColA to ColE for individual dataframes against Col1 and Col3 combinations. So, we need to keep this in mind.
You can do it this way:
columns = df.Col2.max().split(':')
#['A', 'B', 'C', 'D', 'E']
new = df.Col2.str.split(":", expand = True)
new.columns = columns
new = new.add_prefix("Col")
df.join(new).drop("Col2", 1)
# Col1 Col3 ColA ColB ColC ColD ColE
#0 A A A B C None None
#1 A A A B C D E
#2 A A A B None None None
i am trying to do a groupby on the id column such that i can show the number of rows in col1 that is equal to 1.
df:
id col1 col2 col3
a 1 1 1
a 0 1 1
a 1 1 1
b 1 0 1
my code:
df.groupby(['id'])[col1].count()[1]
output i got was 2. It didnt show me the values from other ids like b.
i want:
id col1
a 2
b 1
if possible can the total rows per id also be displayed as a new column?
example:
id col1 total
a 2 3
b 1 1
Assuming you have only 1 and 0 in col1, you can use agg:
df.groupby('id', as_index=False)['col1'].agg({'col1': 'sum', 'total': 'count'})
# id total col1
#0 a 3 2
#1 b 1 1
It's because your rows which id is 'a' sums to 3. The 2 of them are identical that's why it's been grouped and considered as one then it added the unique row which contains the 0 value on its col 1. You can't group rows with different values on its rows.
Yes you can add it on your output. Just place a method how you counted all rows on your column section of your code.
If you want to generalize the solution to include values in col1 that are not zero you can do the following. This also orders the columns correctly.
df.set_index('id')['col1'].eq(1).groupby(level=0).agg([('col1', 'sum'), ('total', 'count')]).reset_index()
id col1 total
0 a 2.0 3
1 b 1.0 1
Using a tuple in the agg method where the first value is the column name and the second the aggregating function is new to me. I was just experimenting and it seemed to work. I don't remember seeing it in the documentation so use with caution.
I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.
I have the following input:
col1 col2 col3
1 4 0
0 12 2
2 12 4
3 2 1
I want to sort the DataFrame according to the values in the columns, e.g. sorting it primarily for df[df==0].count() and secondarily for df.sum() would produce the output:
col2 col3 col1
4 0 1
12 2 0
12 4 2
2 1 3
pd.DataFrame.sort() takes a colum object as argument, which does not apply here, so how can I achieve this?
Firstly, I think your zero count is increasing from right to left whereas your sum is decreasing, so I think you need to clarify that. You can get the number of zero rows simply by (df == 0).sum().
To sort by a single aggregate, you can do something like:
col_order = (df == 0).sum().sort(inplace=False).index
df[col_order]
This sorts the series of aggregates by its values and the resulting index is the columns of df in the order you want.
To sort on two sets of values would be more awkward/tricky but you could do something like
aggs = pd.DataFrame({'zero_count': (df == 0).sum(), 'sum': df.sum()})
col_order = aggs.sort(['zero_count', 'sum'], inplace=False).index
df[col_order]
Note that the sort method takes an ascending parameter which takes either a Boolean or a list of Booleans of equal length to the number of columns you are sorting on, e.g.
df.sort(['a', 'b', ascending=[True, False])