Mark highest documents with true - python

I have a dataframe with two columns:
name and version.
I want to add a boolean in an extra column. So if the highest version then true, otherwise false.
import pandas as pd
data = [['a', 1], ['b', 2], ['a', 2], ['a', 2], ['b', 4]]
df = pd.DataFrame(data, columns = ['name', 'version'])
df
is it best to use groupby for this?
I have tried smth. like this but I do not know how to add extra column with bolean.
df.groupby(['name']).max()

Compare maximal values per groups created by GroupBy.transform with max for generate new column/ Series filled by maxinmal values, so possible compare by original column:
df['bool_col'] = df['version'] == df.groupby('name')['version'].transform('max')
print(df)
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 True
3 a 2 True
4 b 4 True
Detail:
print(df.groupby('name')['version'].transform('max'))
0 2
1 4
2 2
3 2
4 4
Name: version, dtype: int64

You can assign your column directly:
df['bool_col'] = df['version'] == max(df['version'])
Output:
name version bool_col
0 a 1 False
1 b 2 False
2 a 2 False
3 a 2 False
4 b 4 True
Is this what you were looking for?

Related

How to broadcast-map by columns from one dataframe to another?

I'd like to broadcast or expand a dataframe columns-wise from a smaller set index to a larger set index based on a mapping specification. I have the following example, please accept small mistakes as this is untested
import pandas as pd
# my broadcasting mapper spec
mapper = pd.Series(data=['a', 'b', 'c'], index=[1, 2, 2])
# my data
df = pd.DataFrame(data={1: [3, 4], 2: [5, 6]})
print(df)
# 1 2
# --------
# 0 3 5
# 1 4 6
# and I would like to get
df2 = ...
print(df2)
# a b c
# -----------
# 0 3 5 5
# 1 4 6 6
Simply mapping the columns will not work as there are duplicates, I would like to instead expand to the new values as defined in mapper:
# this will of course not work => raises InvalidIndexError
df.columns = df.columns.as_series().map(mapper)
A naive approach would just iterate the spec ...
df2 = pd.DataFrame(index=df.index)
for i, v in df.iteritems():
df2[v] = df[i]
Use reindex and set_axis:
out = df.reindex(columns=mapper.index).set_axis(mapper, axis=1)
Output:
a b c
0 3 5 5
1 4 6 6
You can use pd.concat + df.get:
pd.concat({v:df.get(k) for k,v in mapper.items()},axis=1)
a b c
0 3 5 5
1 4 6 6

Count NA and none-NA per group in pandas

I assume this is a simple task for pandas but I don't get it.
I have data liket this
Group Val
0 A 0
1 A 1
2 A <NA>
3 A 3
4 B 4
5 B <NA>
6 B 6
7 B <NA>
And I want to know the frequency of valid and invalid values in Val per group Group. This is the expected result.
A B Total
Valid 3 2 5
NA 1 2 3
Here is code to generate that sample data.
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame({
'Group': list('AAAABBBB'),
'Val': range(8)
})
# some values to NA
for idx in [2, 5, 7]:
df.iloc[idx, 1] = pd.NA
print(df)
What I tried is something with grouping
>>> df.groupby('Group').agg(lambda x: x.isna())
Val
Group
A [False, False, True, False]
B [False, True, False, True]
>>> df.groupby('Group').apply(lambda x: x.isna())
Group Val
0 False False
1 False False
2 False True
3 False False
4 False False
5 False True
6 False False
7 False True
You are close with using groupby and isna
new = df.groupby(['Group', df['Val'].isna().replace({True: 'NA', False: 'Valid'})])['Group'].count().unstack(level=0)
new['Total'] = new.sum(axis=1)
print(new)
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
here is one way to do it
# cross tab to take the summarize
# convert Val to NA or Valid depending on the value
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'] )
.reset_index()
.rename_axis(columns=None))
df2['Total']=df2.sum(axis=1, numeric_only=True) # add Total column
out=df2.set_index('Val') # set index to match expected output
out
A B Total
Val
NA 1 2 3
Valid 3 2 5
if you need both row and column total, then it'll be even simpler with crosstab
df2=(pd.crosstab(df['Val'].isna().map({True: 'NA', False: 'Valid'}),
df['Group'],
margins=True, margins_name='Total')
Group A B Total
Val
NA 1 2 3
Valid 3 2 5
Total 4 4 8
Another possible solution, based on pandas.pivot_table and on the following ideas:
Add a new column, status, which contains NA or Valid if the corresponding value is or is not NaN, respectively.
Create a pivot table, using len as aggregation function.
Add the Total column, by summing by rows.
(df.assign(status=np.where(df['Val'].isna(), 'NA', 'Valid'))
.pivot_table(
index='status', columns='Group', values='Val',
aggfunc=lambda x: len(x))
.reset_index()
.rename_axis(None, axis=1)
.assign(Total = lambda x: x.sum(axis=1)))
Output:
status A B Total
0 NA 1 2 3
1 Valid 3 2 5

Finding elements in a pandas dataframe

I have a pandas dataframe which looks like the following:
0 1
0 2
2 3
1 4
What I want to do is the following: if I get 2 as input my code is supposed to search for 2 in the dataframe and when it finds it returns the value of the other column. In the above example my code would return 0 and 3. I know that I can simply look at each row and check if any of the elements is equal to 2 but I was wondering if there is one-liner for such a problem.
UPDATE: None of the columns are index columns.
Thanks
>>> df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
>>> df
A B
0 0 1
1 0 2
2 2 3
3 1 4
The following pandas syntax is equivalent to the SQL SELECT B FROM df WHERE A = 2
>>> df[df['A'] == 2]['B']
2 3
Name: B, dtype: int64
There's also pandas.DataFrame.query:
>>> df.query('A == 2')['B']
2 3
Name: B, dtype: int64
You may need this:
n_input = 2
df[(df == n_input).any(1)].stack()[lambda x: x != n_input].unique()
# array([0, 3])
df = pd.DataFrame({'A': [0, 0, 2, 1], 'B': [1,2,3,4]})
t = [df.loc[lambda df: df['A'] == 3]]
t

pandas - Going from aggregated format to long format

If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a

Find missing data in pandas dataframe and fill with NA

I have a dataframe in pandas with company name and date as multi-index.
companyname date emp1 emp2 emp3..... emp80
Where emp1, emp2 is the count of phone calls made by emp1 and 2 respectively on that date. Now there are dates when no employee made a call. Means there are rows where all the column values are 0. I want to fill these values by NA.
Should I manually write the names of all columns in some function? Any suggestions how to achieve this?
You can check that the entire row is 0 with all:
In [11]: df = pd.DataFrame([[1, 2], [0, 4], [0, 0], [7, 8]])
In [12]: df
Out[12]:
0 1
0 1 2
1 0 4
2 0 0
3 7 8
In [13]: (df == 0).all(1)
Out[13]:
0 False
1 False
2 True
3 False
dtype: bool
Now you can assign all the entries in these rows to NaN using loc:
In [14]: df.loc[(df == 0).all(1)] = np.nan
In [15]: df
Out[15]:
0 1
0 1 2
1 0 4
2 NaN NaN
3 7 8

Categories