I want to create a new column on pandas related to the info that I have on column C and want to create a column D..The data that I have has 50k columns so it is impossible for me to do it manually..
A sample of the data is ;
A B C
12 12 3:02
13 13 2:02
14 14 3:03
15 15 1:04
16 16 2:05
I need to dive the values into 2 parts at column C from the colon symbol ;
if the first value is bigger than the second like in row 1 == 3>02 the value on D column value will be A
if both values are equal like in rows 2 and 3 (2:02/3:03) the value on D column value will be B
if second value is bigger than the first value like in rows 4 and 5 (1:04 /2:05 ) D column value will be C
so the new data will look like
A B C D
2 12 3:02 A
13 13 2:02 B
14 14 3:03 B
15 15 1:04 C
16 16 2:05 C
Thanks in advance .
Use numpy.select with new DataFrame created by Series.str.split and expand=True:
df1 = df['C'].str.split(':', expand=True).astype(int)
print(df1)
0 1
1 3 2
2 2 2
3 3 3
4 1 4
5 2 5
df['D'] = np.select([df1[0] > df1[1], df1[0] == df1[1], df1[0] < df1[1]], ['A','B','C'])
print (df)
A B C D
1 12 12 3:02 A
2 13 13 2:02 B
3 14 14 3:03 B
4 15 15 1:04 C
5 16 16 2:05 C
Related
I have a dataframe with stores and its invoices numbers and I need to find the missing consecutive invoices numbers per Store, for example:
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C','D','D']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203','204','206']
Store Invoice
0 A 1
1 A 2
2 A 5
3 A 6
4 A 8
5 B 20
6 B 23
7 B 24
8 B 30
9 C 200
10 C 202
11 C 203
12 D 204
13 D 206
And I want a dataframe like this:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Thanks in advance!
You can use groupby.apply to compute a set difference with the range from the min to max value. Then explode:
(df1.astype({'Invoice': int})
.groupby('Store')['Invoice']
.apply(lambda s: set(range(s.min(), s.max())).difference(s))
.explode().reset_index()
)
NB. if you want to ensure having sorted values, use lambda s: sorted(set(range(s.min(), s.max())).difference(s)).
Output:
Store Invoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
11 D 205
Here's an approach:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['Store'] = ['A','A','A','A','A','B','B','B','B','C','C','C']
df1['Invoice'] = ['1','2','5','6','8','20','23','24','30','200','202','203']
df1['Invoice'] = df1['Invoice'].astype(int)
df2 = df1.groupby('Store')['Invoice'].agg(['min','max'])
df2['MissInvoice'] = [[]]*len(df2)
for store,row in df2.iterrows():
df2.at[store,'MissInvoice'] = np.setdiff1d(np.arange(row['min'],row['max']+1),
df1.loc[df1['Store'] == store, 'Invoice'])
df2 = df2.explode('MissInvoice').drop(columns = ['min','max']).reset_index()
The resulting dataframe df2:
Store MissInvoice
0 A 3
1 A 4
2 A 7
3 B 21
4 B 22
5 B 25
6 B 26
7 B 27
8 B 28
9 B 29
10 C 201
Note: Store D is absent from the dataframe in my code because it is omitted from the lines in the question defining df1.
I have a data frame. I want to rank each column based on its row value
Ex:
xdf = pd.DataFrame({'A':[10,20,30],'B':[5,30,20],'C':[15,3,8]})
xdf =
A B C
0 10 5 15
1 20 30 3
2 30 20 8
Expected result:
xdf =
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
OR
xdf =
A B C A_Rk B_Rk C_Rk
0 10 5 15 2 3 1
1 20 30 3 2 1 2
2 30 20 8 1 2 3
Why I need this:
I want to track the trend of each column and how it is changing. I would like to show this by the plot. Maybe a bar plot showing how many times A got Rank1, 2, 3, etc.
My approach:
xdf[['Rk_1','Rk_2','Rk_3']] = ""
for i in range(len(xdf)):
xdf.loc[i,['Rk_1','Rk_2','Rk_3']] = dict(sorted(dict(xdf[['A','B','C']].loc[i]).items(),reverse=True,key=lambda item:item[1])).keys()
Present output:
A B C Rk_1 Rk_2 Rk_3
0 10 5 15 C A B
1 20 30 3 B A C
2 30 20 8 A B C
I am iterating through each row, converting each row, column into a dictionary, sorting the values, and then extracting the keys (columns). Is there a better approach? My actual data frame has 10000 rows, 12 columns to be ranked. I just executed and it took around 2 minutes.
You should be able to get your desired dataframe by using:
ranked = xdf.join(xdf.rank(ascending=False, method='first', axis=1), rsuffix='_rank')
This'll give you:
A B C A_rank B_rank C_rank
0 10 5 15 2.0 3.0 1.0
1 20 30 3 2.0 1.0 3.0
2 30 20 8 1.0 2.0 3.0
Then do whatever you need to do plotting wise.
When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!
Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0
To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.
Is it possible to iterate over a dataframe and create new columns based on operations performed on existing columns?
For instance if my existing dataframe has 4 columns: a, b, c, d.
I want to create new columns adding a and b, then a and c, then a and d, then b and c, then b and d, then c and d.
I know you can manually create a new column but the actual project I am working on has many more columns so I am wondering if it can be done with a for loop.
Thanks.
For summation, yes, you can do with broadcasting. For general function, you may want to write a loop.
vals = df.to_numpy()
# new column names
cols = pd.MultiIndex.from_product([df.columns, df.columns])
# output:
pd.DataFrame((vals[:,:,None] + vals[:,None,:]).reshape(len(df), -1),
index=df.index,
columns=cols)
Output:
a b c d
a b c d a b c d a b c d a b c d
0 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6
1 8 9 10 11 9 10 11 12 10 11 12 13 11 12 13 14
2 16 17 18 19 17 18 19 20 18 19 20 21 19 20 21 22
I am currently trying to make use of Pandas MultiIndex attribute. I am trying to group an existing DataFrame-object df_original based on its columns in a smart way, and was therefore thinking of MultiIndex.
print df_original =
by_currency by_portfolio A B C
1 AUD a 1 2 3
2 AUD b 4 5 6
3 AUD c 7 8 9
4 AUD d 10 11 12
5 CHF a 13 14 15
6 CHF b 16 17 18
7 CHF c 19 20 21
8 CHF d 22 23 24
Now, what I would like to have is a MultiIndex DataFrame-object, with A, B and C, and by_portfolio as indices. Looking like
CHF AUD
A a 13 1
b 16 4
c 19 7
d 22 10
B a 14 2
b 17 5
c 20 8
d 23 11
C a 15 3
b 18 6
c 21 9
d 24 12
I have tried making all columns in df_original and the sought after indices into list-objects, and from there create a new DataFrame. This seems a bit cumbersome, and I can't figure out how to add the actual values after.
Perhaps some sort of groupby is better for this purpose? Thing is I will need to be able to add this data to another, similar, DataFrame, so I will need the resulting DataFrame to be able to be added to another one later on.
Thanks
You can use a combination of stack and unstack:
In [50]: df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
Out[50]:
by_currency AUD CHF
by_portfolio
a A 1 13
B 2 14
C 3 15
b A 4 16
B 5 17
C 6 18
c A 7 19
B 8 20
C 9 21
d A 10 22
B 11 23
C 12 24
To obtain your desired result, we only need to swap the levels of the index:
In [51]: df2 = df.set_index(['by_currency', 'by_portfolio']).stack().unstack(0)
In [52]: df2.columns.name = None
In [53]: df2.index = df2.index.swaplevel(0,1)
In [55]: df2 = df2.sort_index()
In [56]: df2
Out[56]:
AUD CHF
by_portfolio
A a 1 13
b 4 16
c 7 19
d 10 22
B a 2 14
b 5 17
c 8 20
d 11 23
C a 3 15
b 6 18
c 9 21
d 12 24