Create equally divided ID per group with min and max using Pandas - python

I have the following dataframe (sample):
import pandas as pd
min_id = 1
max_id = 10
data = [['A', 2], ['A', 3], ['A', 1], ['A', 4], ['A', 4], ['A', 2],
['B', 4], ['B', 5], ['B', 7], ['B', 4], ['B', 2],
['C', 1], ['C', 3], ['C', 2], ['C', 1], ['C', 5], ['C', 2] ,['C', 1],
['D', 1], ['D', 1], ['D', 1], ['D', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'val'])
group val
0 A 2
1 A 3
2 A 1
3 A 4
4 A 4
5 A 2
6 B 4
7 B 5
8 B 7
9 B 4
10 B 2
11 C 1
12 C 3
13 C 2
14 C 1
15 C 5
16 C 2
17 C 1
18 D 1
19 D 1
20 D 1
21 D 1
I would like to create a column called "id" which shows the id with a min value of 1 (min_id) and a max value of 10 (max_id) per group. So the values between min and max depend on the number of rows per group. Here you can see the desired output:
data = [['A', 2, 1], ['A', 3, 2.8], ['A', 1, 4.6], ['A', 4, 6.4], ['A', 4, 8.2], ['A', 2, 10],
['B', 4, 1], ['B', 5, 3.25], ['B', 7, 5.5], ['B', 4, 7.75], ['B', 2, 10],
['C', 1, 1], ['C', 3, 2.5], ['C', 2, 4], ['C', 1, 5.5], ['C', 5, 7], ['C', 2, 8.5] ,['C', 1, 10],
['D', 1, 1], ['D', 1, 4], ['D', 1, 7], ['D', 1, 10]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'val', 'id'])
group val id
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
So I was wondering if anyone knows how to automatically create the column "id" using pandas? Please note that the number of rows could be way more then in the sample dataframe.

Use GroupBy.transform with numpy.linspace:
df['ID']=df.groupby('group')['group'].transform(lambda x: np.linspace(min_id,max_id,len(x)))
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00

Related

Shaping a Pandas DataFrame (multiple columns into 2)

I have a simular dataframe to below and require it to be shaped as per expected output.
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
'col2': [1, 3, 5, 7, 9, 11],
'col3': [2, 4, 6, 8, 10, 12]
})
col1 col2 col3
0 A 1 2
1 A 3 4
2 A 5 6
3 B 7 8
4 B 9 10
5 B 11 12
Expected Output
df_expected = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': [7, 8, 9, 10, 11, 12]
})
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
So far I have tried pack, unpack & pivot without getting the desired result
Thanks for your help!
pd.DataFrame(df.groupby('col1').agg(list).T.sum().to_dict())
Use Numpy to reshape the data then package back up into a dataframe.
cols = (df['col2'],df['col3'])
data = np.stack(cols,axis=1).reshape(len(cols),len(df))
dft = pd.DataFrame(data, index=df['col1'].unique()).T
print(dft)
Result
A B
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12

Pandas: moving data from two dataframes to another with tuple index

I have three dataframes like the following:
final_df
other ref
(2014-12-24 13:20:00-05:00, a) NaN NaN
(2014-12-24 13:40:00-05:00, b) NaN NaN
(2018-07-03 14:00:00-04:00, d) NaN NaN
ref_df
a b c d
2014-12-24 13:20:00-05:00 1 2 3 4
2014-12-24 13:40:00-05:00 2 3 4 5
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 9 10 11 12
2019-07-03 13:10:00-04:00 ..............
other_df
a b c d
2014-12-24 13:20:00-05:00 10 20 30 40
2014-12-24 13:40:00-05:00 20 30 40 50
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:20:00-04:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 90 100 110 120
2019-07-03 13:10:00-04:00 ..............
And I need to remplace the NaN values in my final_df with the related dataframe to be like that:
other ref
(2014-12-24 13:20:00-05:00, a) 10 1
(2014-12-24 13:40:00-05:00, b) 30 3
(2018-07-03 14:00:00-04:00, d) 110 11
How can I get it?
pandas.DataFrame.lookup
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
map and get
For when you have missing bits
final_df['ref'] = list(map(ref_df.stack().get, final_df.index))
final_df['other'] = list(map(other_df.stack().get, final_df.index))
Demo
Setup
idx = pd.MultiIndex.from_tuples([(1, 'a'), (2, 'b'), (3, 'd')])
final_df = pd.DataFrame(index=idx, columns=['other', 'ref'])
ref_df = pd.DataFrame([
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 9, 10, 11, 12]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
other_df = pd.DataFrame([
[ 10, 20, 30, 40],
[ 20, 30, 40, 50],
[ 90, 100, 110, 120]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
print(final_df, ref_df, other_df, sep='\n\n')
other ref
1 a NaN NaN
2 b NaN NaN
3 d NaN NaN
a b c d
1 1 2 3 4
2 2 3 4 5
3 9 10 11 12
a b c d
1 10 20 30 40
2 20 30 40 50
3 90 100 110 120
Result
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
final_df
other ref
1 a 10 1
2 b 30 3
3 d 120 12
Another solution that can work with missing dates in ref_df and other_df:
index = pd.MultiIndex.from_tuples(final_df.index)
ref = ref_df.stack().rename('ref')
other = other_df.stack().rename('other')
result = pd.DataFrame(index=index).join(ref).join(other)

Performing outer join that merges joined columns

I am performing an outer join on two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'date': [4, 5, 6, 7, 8],
'str': ['a', 'b', 'c', 'd', 'e']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 6],
'date': [4, 5, 6, 7, 8],
'str': ['A', 'B', 'C', 'D', 'Q']})
pd.merge(df1, df2, on=["id","date"], how="outer")
This gives the result
date id str_x str_y
0 4 1 a A
1 5 2 b B
2 6 3 c C
3 7 4 d D
4 8 5 e NaN
5 8 6 NaN Q
Is it possible to perform the outer join such that the str-columns are concatenated? In other words, how to perform the join such that we get the DataFrame
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
where all NaN have been set to None.
I think not, possible solution is replace NaNs and join together:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x['str'].fillna('') + x['str_'].fillna(''))
.drop('str_', 1))
Similar alternative:
df = (pd.merge(df1, df2, on=["id","date"], how="outer", suffixes=('','_'))
.assign(str=lambda x: x.filter(like='str').fillna('').values.sum(axis=1))
.drop('str_', 1))
print (df)
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q
If 'id', 'date' is unique in each data frame, then you can set the index and add the dataframes.
icols = ['date', 'id']
df1.set_index(icols).add(df2.set_index(icols), fill_value='').reset_index()
date id str
0 4 1 aA
1 5 2 bB
2 6 3 cC
3 7 4 dD
4 8 5 e
5 8 6 Q

groupby common values in two columns

I need to extract a common max value from pairs of rows that have common values in two columns.
The commonality is between values in columns A and B. Rows 0 and 1 are common, 2 and 3, and 4 is on its own.
f = DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]], columns=['A', 'B', 'Value'])
f
A B Value
0 1 2 30
1 2 1 20
2 2 6 15
3 6 2 70
4 7 10 35
The goal is to extract max values, so the end result is:
f_final = DataFrame([[1, 2, 30, 30], [2, 1, 20, 30], [2, 6, 15, 70], [6, 2, 70, 70], [7, 10, 35, 35]], columns=['A', 'B', 'Value', 'Max'])
f_final
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
I could do this if there is a way to assign a common, non-repeating key:
f_key = DataFrame([[1, 1, 2, 30], [1, 2, 1, 20], [2, 2, 6, 15], [2, 6, 2, 70], [3, 7, 10, 35]], columns=['key', 'A', 'B', 'Value'])
f_key
key A B Value
0 1 1 2 30
1 1 2 1 20
2 2 2 6 15
3 2 6 2 70
4 3 7 10 35
Following up with the groupby and transform:
f_key['Max'] = f_key.groupby(['key'])['Value'].transform(lambda x: x.max())
f_key.drop('key', 1, inplace=True)
f_key
A B Value Max
0 1 2 30 30
1 2 1 20 30
2 2 6 15 70
3 6 2 70 70
4 7 10 35 35
Question 1:
How would one assign this common key?
Question 2:
Is there a better way of doing this, skipping the common key step
Cheers...
You could sort the values in columns A and B so that for each row the value in A is less than or equal to the value in B. Once the values are ordered, then you could apply groupby-transform-max as usual:
import pandas as pd
df = pd.DataFrame([[1, 2, 30], [2, 1, 20], [2, 6, 15], [6, 2, 70], [7, 10, 35]],
columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
print(df)
yields
A B Value Max
0 1 2 30 30
1 1 2 20 30
2 2 6 15 70
3 2 6 70 70
4 7 10 35 35
The above method will still work even if the values in A and B are strings. For example,
df = DataFrame([['ab', 'ac', 30], ['ac', 'ab', 20],
['cb', 'ca', 15], ['ca', 'cb', 70],
['ff', 'zz', 35]], columns=['A', 'B', 'Value'])
mask = df['A'] > df['B']
df.loc[mask, ['A','B']] = df.loc[mask, ['B','A']].values
df['Max'] = df.groupby(['A', 'B'])['Value'].transform('max')
yields
In [267]: df
Out[267]:
A B Value Max
0 ab ac 30 30
1 ab ac 20 30
2 ca cb 15 70
3 ca cb 70 70
4 ff zz 35 35

python pandas: pivot_table silently drops indices with nans

Is there an option not to drop the indices with NaN in them? I think silently dropping these rows from the pivot will at some point cause someone serious pain.
import pandas
import numpy
a = [['a', 'b', 12, 12, 12], ['a', numpy.nan, 12.3, 233., 12], ['b', 'a', 123.23, 123, 1], ['a', 'b', 1, 1, 1.]]
df = pandas.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'])
df_pivot = df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
print(df)
print(df_pivot)
Output:
a b c d e
0 a b 12.00 12 12
1 a NaN 12.30 233 12
2 b a 123.23 123 1
3 a b 1.00 1 1
c d e
a b
a b 13.00 13 13
b a 123.23 123 1
This is currently not supported, see this issue for the enhancement: https://github.com/pydata/pandas/issues/3729.
Workaround to fill the index with a dummy, pivot, and replace
In [28]: df = df.reset_index()
In [29]: df['b'] = df['b'].fillna('dummy')
In [30]: df['dummy'] = np.nan
In [31]: df
Out[31]:
a b c d e dummy
0 a b 12.00 12 12 NaN
1 a dummy 12.30 233 12 NaN
2 b a 123.23 123 1 NaN
3 a b 1.00 1 1 NaN
In [32]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum)
Out[32]:
c d e
a b
a b 13.00 13 13
dummy 12.30 233 12
b a 123.23 123 1
In [33]: df.pivot_table(index=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum).reset_index().replace('dummy',np.nan).set_index(['a','b'])
Out[33]:
c d e
a b
a b 13.00 13 13
NaN 12.30 233 12
b a 123.23 123 1
Currently the option "dropna=False" is supported by pivot_table:
df.pivot_table(rows=['a', 'b'], values=['c', 'd', 'e'], aggfunc=sum, dropna=False)

Categories