crosstab to fill with data of another column - python

I dont arrive to populate a crosstab with data from another colum: maybe its not the solution...
initial dataframe final waited
id id_m X
0 10 10 a
1 10 11 b id_m 10 11 12
2 10 12 c id
3 11 10 d -> 10 a b c
4 11 11 e 11 d e f
5 11 12 f 12 g h i
6 12 10 g
7 12 11 h
8 12 12 i
my code to help you:
import pandas as pd
df= pd.DataFrame({'id': [10, 11,12]})
df_m = pd.merge(df.assign(key=0), df.assign(key=0), suffixes=('', '_m'), on='key').drop('key', axis=1)
# just a sample to populate the column
df_m['X'] =['a','b' ,'c','d', 'e','f','g' ,'h', 'i']

If your original df is this
id id_m X
0 10 10 a
1 10 11 b
2 10 12 c
3 11 10 d
4 11 11 e
5 11 12 f
6 12 10 g
7 12 11 h
8 12 12 i
And all you want is this
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i
You can groupby the id and id_m columns, take the max of the X column, then unstack the id_m column like this.
df.groupby([
'id',
'id_m'
]).X.max().unstack()
If you really want to use pivot_table you can do this too
df.pivot_table(index='id', columns='id_m', values='X', aggfunc='max')
Same results.
Lastly, you can use just pivot since your rows are unique with respect to the indices and columns.
df.pivot(index='id', columns='id_m')
References
groupby
pivot_table
pivot

Yours is a bit more tricky since you have text as values, you have to explicitly tell pandas the aggfunc, you can use a lambda function for that like the following:
df_final = pd.pivot_table(df_m, index='id', columns='id_m', values='X', aggfunc=lambda x: ' '.join(x) )
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i

Related

Find missing numbers in a column dataframe pandas

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Pandas groupby and pivot

I have the following pandas data frame
id category counts_mean
0 8 a 23
1 8 b 22
2 8 c 23
3 8 d 30
4 9 a 40
5 9 b 22
6 9 c 11
7 9 d 10
....
And I want to group by the id and transpose the category columns to get something like this:
id a b c d
0 8 23 22 23 30
1 9 40 22 11 10
I tried different things with groupby and pivot, but I'm not sure what should be the aggregation argument for the groupby...
Instead using groupby and pivot, you just need to use the pivot function and set the parameters (index , columns, values) to re-shape your DataFrame.
#Creat the DataFrame
data = {
'id': [8,8,8,8,9,9,9,9],
'catergory': ['a','b','c','d','a','b','c','d'],
'counts_mean': [23,22,23,30,40,22,11,10]
}
df = pd.DataFrame(data)
# Using pivot to reshape the DF
df_reshaped = df.pivot(index='id',columns='catergory',values = 'counts_mean')
print(df_reshaped)
output:
catergory a b c d
id
8 23 22 23 30
9 40 22 11 10

Python- Pandas, filter take last element in group, then first element in group

Follow up question from here: drop first and last row from within each group
In pandas, how do you drop the last row in the first groupby then the first row for all subsequent entries in the group?
e.g
X Y
a a 0 1
a 2 3
c 4 5
d 6 7
b e 8 9
f 10 11
g 12 13
c h 14 15
i 16 17
d j 18 19
I want this
X Y
a d 6 7
b e 8 9
c h 14 15
d j 18 19
First check first value of first level by get_level_values and then groupby with apply - first group by tail and all another by head:
first = df.index.get_level_values(0)[0]
df = df.groupby(level=0, sort=False, group_keys=False)
.apply(lambda x: x.tail(1) if x.name == first else x.head(1))
print (df)
X Y
a d 6 7
b e 8 9
c h 14 15
d j 18 19

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

Python: Pandas, Dataframe, Convert 1column data into 2D data format

This is Pandas dataframe
I want to convert 1D data into 2D array form
How do I convert from
'A' 'B' 'C'
1 10 11 a
2 10 12 b
3 10 13 c
4 20 11 d
5 20 12 e
6 20 13 f
to this 2d array as the following
11 12 13
10 a b c
20 d e f
>>> df.pivot('A', 'B', 'C')
B 11 12 13
A
10 a b c
20 d e f
Where df is:
>>> df = DataFrame(dict(A=[10]*3+[20]*3, B=range(11, 14)*2, C=list('abcdef')))
>>> df
A B C
0 10 11 a
1 10 12 b
2 10 13 c
3 20 11 d
4 20 12 e
5 20 13 f
See Reshaping and Pivot Tables
You can also use panels to help you do this pivot. Like this:-
In [86]: panel = df.set_index(['A', 'B']).sortlevel(0).to_panel()
In [87]: panel["C"]
Out[87]:
B 11 12 13
A
10 a b c
20 d e f
Which gives you the same result as Sebastian's answer above.

Categories