Python: Pandas, Dataframe, Convert 1column data into 2D data format

Python: Pandas, Dataframe, Convert 1column data into 2D data format - python

This is Pandas dataframe
I want to convert 1D data into 2D array form
How do I convert from
'A' 'B' 'C'
1 10 11 a
2 10 12 b
3 10 13 c
4 20 11 d
5 20 12 e
6 20 13 f
to this 2d array as the following
11 12 13
10 a b c
20 d e f

>>> df.pivot('A', 'B', 'C')
B 11 12 13
A
10 a b c
20 d e f
Where df is:
>>> df = DataFrame(dict(A=[10]*3+[20]*3, B=range(11, 14)*2, C=list('abcdef')))
>>> df
A B C
0 10 11 a
1 10 12 b
2 10 13 c
3 20 11 d
4 20 12 e
5 20 13 f
See Reshaping and Pivot Tables

You can also use panels to help you do this pivot. Like this:-
In [86]: panel = df.set_index(['A', 'B']).sortlevel(0).to_panel()
In [87]: panel["C"]
Out[87]:
B 11 12 13
A
10 a b c
20 d e f
Which gives you the same result as Sebastian's answer above.

Related

Find missing numbers in a column dataframe pandas

I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!

You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205

Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.

Creating a new column from the values of a column - Pandas

I want to create a new column on pandas related to the info that I have on column C and want to create a column D..The data that I have has 50k columns so it is impossible for me to do it manually..
A sample of the data is ;
A B C
12 12 3:02
13 13 2:02
14 14 3:03
15 15 1:04
16 16 2:05
I need to dive the values into 2 parts at column C from the colon symbol ;
if the first value is bigger than the second like in row 1 == 3>02 the value on D column value will be A
if both values are equal like in rows 2 and 3 (2:02/3:03) the value on D column value will be B
if second value is bigger than the first value like in rows 4 and 5 (1:04 /2:05 ) D column value will be C
so the new data will look like
A B C D
2 12 3:02 A
13 13 2:02 B
14 14 3:03 B
15 15 1:04 C
16 16 2:05 C
Thanks in advance .

Use numpy.select with new DataFrame created by Series.str.split and expand=True:
df1 = df['C'].str.split(':', expand=True).astype(int)
print(df1)
0 1
1 3 2
2 2 2
3 3 3
4 1 4
5 2 5
df['D'] = np.select([df1[0] > df1[1], df1[0] == df1[1], df1[0] < df1[1]], ['A','B','C'])
print (df)
A B C D
1 12 12 3:02 A
2 13 13 2:02 B
3 14 14 3:03 B
4 15 15 1:04 C
5 16 16 2:05 C

creating new dataframe columns by performing operations on existing columns

Is it possible to iterate over a dataframe and create new columns based on operations performed on existing columns?
For instance if my existing dataframe has 4 columns: a, b, c, d.
I want to create new columns adding a and b, then a and c, then a and d, then b and c, then b and d, then c and d.
I know you can manually create a new column but the actual project I am working on has many more columns so I am wondering if it can be done with a for loop.
Thanks.

For summation, yes, you can do with broadcasting. For general function, you may want to write a loop.
vals = df.to_numpy()
# new column names
cols = pd.MultiIndex.from_product([df.columns, df.columns])
# output:
pd.DataFrame((vals[:,:,None] + vals[:,None,:]).reshape(len(df), -1),
index=df.index,
columns=cols)
Output:
a b c d
a b c d a b c d a b c d a b c d
0 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6
1 8 9 10 11 9 10 11 12 10 11 12 13 11 12 13 14
2 16 17 18 19 17 18 19 20 18 19 20 21 19 20 21 22

crosstab to fill with data of another column

I dont arrive to populate a crosstab with data from another colum: maybe its not the solution...
initial dataframe final waited
id id_m X
0 10 10 a
1 10 11 b id_m 10 11 12
2 10 12 c id
3 11 10 d -> 10 a b c
4 11 11 e 11 d e f
5 11 12 f 12 g h i
6 12 10 g
7 12 11 h
8 12 12 i
my code to help you:
import pandas as pd
df= pd.DataFrame({'id': [10, 11,12]})
df_m = pd.merge(df.assign(key=0), df.assign(key=0), suffixes=('', '_m'), on='key').drop('key', axis=1)
# just a sample to populate the column
df_m['X'] =['a','b' ,'c','d', 'e','f','g' ,'h', 'i']

If your original df is this
id id_m X
0 10 10 a
1 10 11 b
2 10 12 c
3 11 10 d
4 11 11 e
5 11 12 f
6 12 10 g
7 12 11 h
8 12 12 i
And all you want is this
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i
You can groupby the id and id_m columns, take the max of the X column, then unstack the id_m column like this.
df.groupby([
'id',
'id_m'
]).X.max().unstack()
If you really want to use pivot_table you can do this too
df.pivot_table(index='id', columns='id_m', values='X', aggfunc='max')
Same results.
Lastly, you can use just pivot since your rows are unique with respect to the indices and columns.
df.pivot(index='id', columns='id_m')
References
groupby
pivot_table
pivot

Yours is a bit more tricky since you have text as values, you have to explicitly tell pandas the aggfunc, you can use a lambda function for that like the following:
df_final = pd.pivot_table(df_m, index='id', columns='id_m', values='X', aggfunc=lambda x: ' '.join(x) )
id_m 10 11 12
id
10 a b c
11 d e f
12 g h i

Multiindex on DataFrames and sum in Pandas

I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks

You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Pandas, Dataframe, Convert 1column data into 2D data format - python

This is Pandas dataframe I want to convert 1D data into 2D array form How do I convert from 'A' 'B' 'C' 1 10 11 a 2 10 12 b 3 10 13 c 4 20 11 d 5 20 12 e 6 20 13 f to this 2d array as the following 11 12 13 10 a b c 20 d e f

>>> df.pivot('A', 'B', 'C') B 11 12 13 A 10 a b c 20 d e f Where df is: >>> df = DataFrame(dict(A=[10]3+[20]3, B=range(11, 14)*2, C=list('abcdef'))) >>> df A B C 0 10 11 a 1 10 12 b 2 10 13 c 3 20 11 d 4 20 12 e 5 20 13 f See Reshaping and Pivot Tables

You can also use panels to help you do this pivot. Like this:- In [86]: panel = df.set_index(['A', 'B']).sortlevel(0).to_panel() In [87]: panel["C"] Out[87]: B 11 12 13 A 10 a b c 20 d e f Which gives you the same result as Sebastian's answer above.

Related

Find missing numbers in a column dataframe pandas

Creating a new column from the values of a column - Pandas

creating new dataframe columns by performing operations on existing columns

crosstab to fill with data of another column

Multiindex on DataFrames and sum in Pandas

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Pandas, Dataframe, Convert 1column data into 2D data format - python

This is Pandas dataframe I want to convert 1D data into 2D array form How do I convert from 'A' 'B' 'C' 1 10 11 a 2 10 12 b 3 10 13 c 4 20 11 d 5 20 12 e 6 20 13 f to this 2d array as the following 11 12 13 10 a b c 20 d e f

>>> df.pivot('A', 'B', 'C') B 11 12 13 A 10 a b c 20 d e f Where df is: >>> df = DataFrame(dict(A=[10]*3+[20]*3, B=range(11, 14)*2, C=list('abcdef'))) >>> df A B C 0 10 11 a 1 10 12 b 2 10 13 c 3 20 11 d 4 20 12 e 5 20 13 f See Reshaping and Pivot Tables

You can also use panels to help you do this pivot. Like this:- In [86]: panel = df.set_index(['A', 'B']).sortlevel(0).to_panel() In [87]: panel["C"] Out[87]: B 11 12 13 A 10 a b c 20 d e f Which gives you the same result as Sebastian's answer above.

Related

Find missing numbers in a column dataframe pandas

Creating a new column from the values of a column - Pandas

creating new dataframe columns by performing operations on existing columns

crosstab to fill with data of another column

Multiindex on DataFrames and sum in Pandas

Categories

Resources

>>> df.pivot('A', 'B', 'C') B 11 12 13 A 10 a b c 20 d e f Where df is: >>> df = DataFrame(dict(A=[10]3+[20]3, B=range(11, 14)*2, C=list('abcdef'))) >>> df A B C 0 10 11 a 1 10 12 b 2 10 13 c 3 20 11 d 4 20 12 e 5 20 13 f See Reshaping and Pivot Tables