Split pandas series by index names, integer index - python

I have a pandas series of this format with multiple non unique indexes (example)
index value
num 1
0 2
num 3
0 4
and would like to split it into 2 series:
index value index value
num 1 0 2
num 3 0 4
The order of the values has to be maintained as in the example( order as they appear in the list). The first can just be obtained by
series.num
or
series['num']
Unfortunately it doesn't work for the second one as the indexes are integers. Anybody has a solution to this?

You can use .iloc[] to locate the rows by index:
df1 = df.iloc[df.index == 'num']
df2 = df.iloc[df.index == 0]
This code will return you with two dataframes, separated by index.

Related

How to compare two columns value in pandas

I Have a dataframe which has some unique IDs in two of the columns.for e.g
S.no. Column1 Column2
1 00001x 00002x
2 00003j 00005k
3 00002x 00001x
4 00004d 00008e
Value can be anything in the string format
I want to compare the two column in such a way that either of s.no 1 or 3 data remains. as these id contains the same information. only its order is different.
Basically if for one row value in a column 1 is X and column 2 is Y and for other row value in column 1 is Y and in Column 2 is x then only one of the row should remain.
is that possible in python?
You can convert your columns as frozenset per row.
This will give a common order to apply duplicated.
Finally, slice the rows using the previous output as mask:
mask = df.filter(like='Column').apply(frozenset, axis=1).duplicated()
df[~mask]
previous answer using set:
mask = df.filter(like='Column').apply(lambda x: tuple(set(x)), axis=1).duplicated()
df[~mask]
NB. Using a set or sorted requires to convert as tuple (lambda x: tuple(sorted(x))) as the duplicated function hashes the values, which is not possible with mutable objects
output:
S.no. Column1 Column2
0 1 00001x 00002x
1 2 00003j 00005k
3 4 00004d 00008e

Lambda function with groupby and condition issue

I wanted to see how to express counts of unique values of column B for each unique value in column A where corresponding column C >0.
df:
A B C
1 10 0
1 12 3
2 3 1
I tried this but its missing the where clause to filter for C>0. How do I add it?
df.groupby(['A'])['B'].apply(lambda b : b.astype(int).nunique())
Let's first start by creating the dataframe that OP mentions in the question
import pandas as pd
df = pd.DataFrame({'A': [1,1,2], 'B': [10,12,3], 'C': [0,3,1]})
Now, in order to achieve what OP wants, there are various options. One of the ways is by selecting the df where column C is greater than 0, then use pandas.DataFrame.groupby to group by column A. Finally, use pandas.DataFrame.nunique to count the unique values of column B. In one line it would look like the following
count = df[df['C'] > 0].groupby('A')['B'].nunique()
[Out]:
A
1 1
2 1
If one wants to sum every number in of unique items that satisfy the condition, from the dataframe count, one can simply do
count = count.sum()
[Out]:
2
Assuming one wants to do everything in one line, one can use pandas.DataFrame.sum as
count = df[df['C'] > 0].groupby('A')['B'].nunique().sum()
[Out]:
2

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

New columns with incremental numbers that initial based on a diffrent column value (pandas)

I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1

Pandas COUNTIF based on column value

I am trying to essentially do a COUNTIF in pandas to count how many items in a row match a number in the first column.
Dataframe:
a b c d
1 2 3 1
2 3 4 2
3 5 6 3
So I want to count instances in a row (b,c,d) that match a. In row 1 for instance it should be 1 as only d matches a.
I have searched quite a bit for this but so far only found examples where its a common number (like counting all values more than 0) but not based on a dataframe column. i'm guessing its some form of logic that masks based on the column but df == df.a doesnt seem to work
You can use eq, which you can pass an axis parameter to specify the direction of the comparison, then you can do a row sum to count the number of matched values:
df.eq(df.a, axis=0).sum(1) - 1
#0 1
#1 1
#2 1
#dtype: int64
df.apply(lambda x: (x == x[0]).sum()-1,axis=1)

Categories