Python: Pandas, Dataframe, Convert 1column data into 2D data format - python

This is Pandas dataframe
I want to convert 1D data into 2D array form
How do I convert from
'A' 'B' 'C'
1 10 11 a
2 10 12 b
3 10 13 c
4 20 11 d
5 20 12 e
6 20 13 f
to this 2d array as the following
11 12 13
10 a b c
20 d e f

>>> df.pivot('A', 'B', 'C')
B 11 12 13
A
10 a b c
20 d e f
Where df is:
>>> df = DataFrame(dict(A=[10]*3+[20]*3, B=range(11, 14)*2, C=list('abcdef')))
>>> df
A B C
0 10 11 a
1 10 12 b
2 10 13 c
3 20 11 d
4 20 12 e
5 20 13 f
See Reshaping and Pivot Tables

You can also use panels to help you do this pivot. Like this:-
In [86]: panel = df.set_index(['A', 'B']).sortlevel(0).to_panel()
In [87]: panel["C"]
Out[87]:
B 11 12 13
A
10 a b c
20 d e f
Which gives you the same result as Sebastian's answer above.

Related

Find missing numbers in a column dataframe pandas

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Creating a new column from the values of a column - Pandas

I want to create a new column on pandas related to the info that I have on column C and want to create a column D..The data that I have has 50k columns so it is impossible for me to do it manually..
A sample of the data is ;
A B C
12 12 3:02
13 13 2:02
14 14 3:03
15 15 1:04
16 16 2:05
I need to dive the values into 2 parts at column C from the colon symbol ;
if the first value is bigger than the second like in row 1 == 3>02 the value on D column value will be A
if both values are equal like in rows 2 and 3 (2:02/3:03) the value on D column value will be B
if second value is bigger than the first value like in rows 4 and 5 (1:04 /2:05 ) D column value will be C
so the new data will look like
A B C D
2 12 3:02 A
13 13 2:02 B
14 14 3:03 B
15 15 1:04 C
16 16 2:05 C
Thanks in advance .
Use numpy.select with new DataFrame created by Series.str.split and expand=True:
df1 = df['C'].str.split(':', expand=True).astype(int)
print(df1)
0 1
1 3 2
2 2 2
3 3 3
4 1 4
5 2 5
df['D'] = np.select([df1[0] > df1[1], df1[0] == df1[1], df1[0] < df1[1]], ['A','B','C'])
print (df)
A B C D
1 12 12 3:02 A
2 13 13 2:02 B
3 14 14 3:03 B
4 15 15 1:04 C
5 16 16 2:05 C

creating new dataframe columns by performing operations on existing columns

Is it possible to iterate over a dataframe and create new columns based on operations performed on existing columns?
For instance if my existing dataframe has 4 columns: a, b, c, d.
I want to create new columns adding a and b, then a and c, then a and d, then b and c, then b and d, then c and d.
I know you can manually create a new column but the actual project I am working on has many more columns so I am wondering if it can be done with a for loop.
Thanks.
For summation, yes, you can do with broadcasting. For general function, you may want to write a loop.
vals = df.to_numpy()
# new column names
cols = pd.MultiIndex.from_product([df.columns, df.columns])
# output:
pd.DataFrame((vals[:,:,None] + vals[:,None,:]).reshape(len(df), -1),
index=df.index,
columns=cols)
Output:
a b c d
a b c d a b c d a b c d a b c d
0 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6
1 8 9 10 11 9 10 11 12 10 11 12 13 11 12 13 14
2 16 17 18 19 17 18 19 20 18 19 20 21 19 20 21 22

crosstab to fill with data of another column

I dont arrive to populate a crosstab with data from another colum: maybe its not the solution...
initial dataframe final waited
id id_m X
0 10 10 a
1 10 11 b id_m 10 11 12
2 10 12 c id
3 11 10 d -> 10 a b c
4 11 11 e 11 d e f
5 11 12 f 12 g h i
6 12 10 g
7 12 11 h
8 12 12 i
my code to help you:
import pandas as pd
df= pd.DataFrame({'id': [10, 11,12]})
df_m = pd.merge(df.assign(key=0), df.assign(key=0), suffixes=('', '_m'), on='key').drop('key', axis=1)
# just a sample to populate the column
df_m['X'] =['a','b' ,'c','d', 'e','f','g' ,'h', 'i']
If your original df is this
id id_m X
0 10 10 a
1 10 11 b
2 10 12 c
3 11 10 d
4 11 11 e
5 11 12 f
6 12 10 g
7 12 11 h
8 12 12 i
And all you want is this
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i
You can groupby the id and id_m columns, take the max of the X column, then unstack the id_m column like this.
df.groupby([
'id',
'id_m'
]).X.max().unstack()
If you really want to use pivot_table you can do this too
df.pivot_table(index='id', columns='id_m', values='X', aggfunc='max')
Same results.
Lastly, you can use just pivot since your rows are unique with respect to the indices and columns.
df.pivot(index='id', columns='id_m')
References
groupby
pivot_table
pivot
Yours is a bit more tricky since you have text as values, you have to explicitly tell pandas the aggfunc, you can use a lambda function for that like the following:
df_final = pd.pivot_table(df_m, index='id', columns='id_m', values='X', aggfunc=lambda x: ' '.join(x) )
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

Categories