Get the minimum for each data in the data frame - python

I have this data frame:
data = {name: ['a', 'a','b', 'c', 'd', 'b', 'b', 'a', 'c'],
number: [32, 25, 9 , 43,8, 5, 11, 21, 0]
}
and I want to get min number for each name where data in the number column for that name is not 0.
for my example, I want this result:
data = {'col1': ['a', 'b', 'c', 'd'],
'col2': [21, 5, 43, 8]
}
I don't want the repetitive name.

IIUC, you can try:
df = df.mask(df.number.eq(0)).dropna().groupby('name', as_index = False).min()
OUTPUT:
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0

Try with sort_values + drop_duplicates
out = df.loc[df.number!=0].sort_values('number').drop_duplicates('name')
Out[24]:
name number
5 b 5
4 d 8
7 a 21
3 c 43

Try:
df = df.query('number != 0')
df.loc[df.groupby('name')['number'].idxmin().tolist()]
Output:
name number
7 a 21
5 b 5
3 c 43
4 d 8

replace with groupby:
df.replace({"number":{0:np.nan}}).groupby("name",as_index=False)['number'].min()
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0
Cast it back to int if you want using astype

Related

Combine Columns in Pandas

Let's say I have the following Pandas dataframe. It is what it is and the input can't be changed.
df1 = pd.DataFrame(np.array([['a', 1,'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1,1,2,2]
See how the columns have the same name? The output I want is to have columns with the same name combined (not summed or concatenated), meaning the second column 1 is added to the end of the first column 1, like so:
df2 = pd.DataFrame(np.array([['a', 'e'],
['b','f'],
['c', 'g'],
['d', 'h'],
[1,5],
[2,6],
[3,7],
[4,8]]))
df2.columns = [1,2]
How do I do this? I can do it manually, except I actually have like 10 column titles, about 100 iterations of each title, and several thousand rows, so it takes forever and I have to redo it with each new dataset.
EDIT: the columns in actual datasets are unequal in length.
Try with groupby and explode:
output = df1.groupby(level=0, axis=1).agg(lambda x: x.values.tolist()).explode(df1.columns.unique().tolist())
>>> output
1 2
0 a e
0 1 5
1 b f
1 2 6
2 c g
2 3 7
3 d h
3 4 8
Edit:
To reorder the rows, you can do:
output = output.assign(order=output.groupby(level=0).cumcount()).sort_values("order",ignore_index=True).drop("order",axis=1)
>>> output
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
Depending on the size of your data, you could split the data into a dictionary and then create a new data frame from that:
df1 = pd.DataFrame(np.array([['a', 1, 'e', 5],
['b', 2, 'f', 6],
['c', 3, 'g', 7],
['d', 4, 'h', 8]]))
df1.columns = [1, 1, 2, 2]
dictionary = {}
for column in df1.columns:
items = []
for item in df1[column].values.tolist():
items += item
dictionary[column] = items
new_df = pd.DataFrame(dictionary)
print(new_df)
You can use a dictionary whose default value is list and loop through the dataframe columns. Use the column name as dictionary key and append the column value to the dictionary value.
from collections import defaultdict
d = defaultdict(list)
for i, col in enumerate(df1.columns):
d[col].extend(df1.iloc[:, i].values.tolist())
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
1 2
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8
For df1.columns = [1,1,2,3], the output is
1 2 3
0 a e 5
1 b f 6
2 c g 7
3 d h 8
4 1 None None
5 2 None None
6 3 None None
7 4 None None
If I understand correctly, this seems to work:
pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Output:
In [3]: pd.concat([s.reset_index(drop=True) for _, s in df1.melt().groupby("variable")["value"]], axis=1)
Out[3]:
value value
0 a e
1 b f
2 c g
3 d h
4 1 5
5 2 6
6 3 7
7 4 8

How to aggregate and groupby in pandas

I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe

Get last observations from Pandas

Assuming the following dataframe:
variable value
0 A 12
1 A 11
2 B 4
3 A 2
4 B 1
5 B 4
I want to extract the last observation for each variable. In this case, it would give me:
variable value
3 A 2
5 B 4
How would you do this in the most panda/pythonic way?
I'm not worried about performance. Clarity and conciseness is important.
The best way I came up with:
df = pd.DataFrame({'variable': ['A', 'A', 'B', 'A', 'B', 'B'], 'value': [12, 11, 4, 2, 1, 4]})
variables = df['variable'].unique()
new_df = df.drop(index=df.index, axis=1)
for v in variables:
new_df = new_df.append(df[df['variable'] == v].tail(1), inplace=True)
Use drop_duplicates
new_df = df.drop_duplicates('variable',keep='last')
Out[357]:
variable value
3 A 2
5 B 4

Groupby, apply function to each row with shift, and create new column

I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0

Pandas - aggregate over inconsistent values types (string vs list)

Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14

Categories