"Is seen before" column for another column - python

Consider following data frame:
a
0 1
1 1
2 2
3 4
4 5
5 6
6 4
Is there a convenient way (without iterating rows) to create a column that represent "is seen before" for every value of column a.
For example desired output for the example is (0 represent not seen before, 1 represent seen before):
0
1
0
0
0
0
1
If this is possible, is there a way to enhance it with counts of previous occurrences and not just binary indicator?

Should just be .duplicated() (see documentation). Then if you want to cast it to an integer for 0's and 1's instead of False and True you can use .astype(int) on the output:
From pd.DataFrame:
df.duplicated(subset="a").astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
dtype: int32
From pd.Series:
df["a"].duplicated().astype(int)
0 0
1 1
2 0
3 0
4 0
5 0
6 1
Name: a, dtype: int32
This will mark the first time a value is "seen" as False, and all subsequent values that have already been "seen" as True. Coercing it to an int datatype via astype will change False -> 0 and True -> 1

Use assign and duplicated:
df.assign(seenbefore = lambda x: x.a.duplicated().astype(int))

Related

How to remove columns if the value of one specific row is 0

I have a fairly straight forward question but I could not find the answer on stack.
I have a pd.df
Index A B C
0 1 1 0
1 0 0 0
2 1 1 1
3 0 0 1
I simply wish to remove all columns where the fourth row (3) is 0. So only column C would remain. Cheers.
Assuming "Index" the index, you can use boolean indexing:
df2 = df.loc[:, df.iloc[3].ne(0)]
output:
C
0 0
1 0
2 1
3 1
output of df.iloc[3].ne(0):
A False
B False
C True
Name: 3, dtype: bool

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Fill a part of a Pandas Series with a value

I need to replace a part of a Series in Pandas with a specific value, I'm not sure how to get around to that.
here's my series:
(Pdb) alfa = pd.Series(0, index=[1,2,3,4,5,6])
(Pdb) alfa
1 0
2 0
3 0
4 0
5 0
6 0
dtype: int64
I'd like to something like this:
(Pdb) alfa.fill([2,3,4], 5)
1 0
2 5
3 5
4 5
5 0
6 0
any clues?
You would do
alfa[[2, 3, 4]] = 5
or, if what you are dealing with happens to always be a contiguous range, cf. the documentation on slicing ranges,
alfa[1:4] = 5
You can use .loc:
alfa.loc[2:4] = 5
If you don't care about the actual value of the index, you can use .iloc:
alfa.iloc[1:4] = 5
Note: .loc will reference/set elements for indices between 2 and 4 inclusive.
are you looking for replace instead loc iloc or slicing ..
Please look at the pandas.DataFrame.replace
>>> s
0 0
1 1
2 2
3 3
4 4
dtype: int64
>>> s.replace([2,3,4], 5)
0 0
1 1
2 5
3 5
4 5
dtype: int64
Note: In this sense your Indexing should start at default zero

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Python - Pandas: select first observation per group

I want to adapt my former SAS code to Python using the dataframe framework.
In SAS I often use this type of code (assume the columns are sorted by group_id where group_id takes values 1 to 10 where there are multiple observations for each group_id):
data want;set have;
by group_id;
if first.group_id then c=1; else c=0;
run;
so what goes on here is that I select the first observations for each id and I create a new variable c that takes value 1 and 0 for the others. The dataset looks like this:
group_id c
1 1
1 0
1 0
2 1
2 0
2 0
3 1
3 0
3 0
How can I do this in Python using dataframe? Assume that I start with the group_id vector only.
If you're using 0.13+ you can use cumcount groupby method:
In [11]: df
Out[11]:
group_id
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
In [12]: df.groupby('group_id').cumcount() == 0
Out[12]:
0 True
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
dtype: bool
You can force the dtype to be int rather than bool:
In [13]: df['c'] = (df.groupby('group_id').cumcount() == 0).astype(int)

Categories