Assign values to dataframe based on index - python

Having a dataframe as
col
1 1
2 2
3 3
and another dataframe where i need to put calculated values from the previous df. the val column is a multiplication of values by index
i j val
1 1 1
1 2 2
1 3 3
2 1 2
2 2 4
2 3 6
3 1 3
3 2 6
3 3 9
ive tried to calculate it as using a loop but i dont think this approach is the fastest one. How can i accomplish this in a more efficient way?

IIUC.
df2 = pd.DataFrame(index=pd.MultiIndex.from_product([df.index, df.col])).reset_index()
df2.columns = ['i', 'j']
df2['val'] = df2.i * df2.j
df2
Out[45]:
i j val
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 2
4 2 2 4
5 2 3 6
6 3 1 3
7 3 2 6
8 3 3 9

I would suggest:
df2['i'] = df.index
df2['j'] = df.col
df2['val'] = df2['j'] * df2['i']

Related

how to loop through columns of a dataframe which have intezers as column names

I have a dataframe with column names as 1,2,3,4..10 . I have sub category of columns as
sub_cols = ['1','2','3']
I want to loop through these sub_cols
for col in sub_cols:
print('column: '+str(col))
data[col]
len(data[col])
I get an output
column: 1
column: 2
column: 3
but data of columns or len of columns is not printed. I don't see any error too. Where am I going wrong?
Your code corrected:
for col in sub_cols:
print('column: '+str(col))
print(data[col])
print(len(data[col])
Looks like your fundamental issue if you list of integers is a list of strings.
following code shows how to loop through columns where name is an integer
simple list comprehension with isinstance(c, int)
cols = [i for i in range(5)]+list("abcd")
df = pd.DataFrame(np.random.randint(1,5,5*len(cols)).reshape(5,len(cols)), columns=cols)
df.loc[:,[c for c in df.columns if isinstance(c, int)]]
df
0
1
2
3
4
a
b
c
d
0
3
4
1
2
4
4
4
1
4
1
1
4
3
2
1
2
4
4
1
2
2
3
4
1
1
4
4
4
2
3
3
1
3
1
2
4
2
4
3
4
3
4
1
4
1
4
1
1
1
df.loc[:,[c for c in df.columns if isinstance(c, int)]]
0
1
2
3
4
0
3
4
1
2
4
1
1
4
3
2
1
2
2
3
4
1
1
3
3
1
3
1
2
4
3
4
1
4
1

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

How to calculate count within the same group based on ID

My DataFrame looks like:
df = pd.DataFrame({"ID":['A','B','A','A','B','B','C','D','D','C'],
'count':[1,1,2,2,2,2,1,1,1,2]})
print(df)
ID count
0 A 1
1 B 1
2 A 2
3 A 2
4 B 2
5 B 2
6 C 1
7 D 1
8 D 1
9 C 2
I will be having only ID column and I want to calculate count column. The logic is I want to cumulatively count the occurrence of an ID. If its repeated immediately like index 2 & 3 they both should get same count. How can I achieve this?
My attempt which is not giving the accurate results:
df['x'] = df['ID'].eq(df['ID'].shift(-1)).astype(int)
df.groupby('ID')['x'].transform('cumsum')+1
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 2
8 2
9 1
Name: x, dtype: int32
The question is not directly related to groupby cumulative count, but it is different.
We can do filter then reindex back
(df[df.ID.ne(df.ID.shift())].groupby('ID').cumcount().add(1)
.reindex(df.index,method='ffill'))
Out[10]:
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 1
8 1
9 2
dtype: int64
You could also use groupby() with sort=False:
df['count2'] = df[(df.ID.ne(df.ID.shift()))].groupby('ID', sort=False).cumcount().add(1)
df['count2'] = df['count2'].ffill()
Output:
ID count count2
0 A 1 1
1 B 1 1
2 A 2 2
3 A 2 2
4 B 2 2
5 B 2 2
6 C 1 1
7 D 1 1
8 D 1 1
9 C 2 2

Replace all column values with value from single column - Pandas

I'm hoping to replace values in all columns within a df using integers from a specified column. Using the df below I want to use the values in Code and replace them in all other columns.
df = pd.DataFrame({
'Place' : ['X','Y','X','Y','X','Y','X','Y'],
'Number' : ['A','B','C','D','F','G','H','I'],
'Code' : [1,2,3,0,1,2,5,4],
'Value' : ['','','','','','','','']
})
df[:] = df['Code'].apply(lambda x: x if np.isreal(x) else 0).astype(int)
print(df)
Intended Output:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Use reindex, ffill, bfill
df[['Code']].reindex(columns=df.columns).ffill(1).bfill(1).astype(int)
Out[256]:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Numpy solution
df[:] = np.transpose([df.Code] * df.shape[1])
Out[314]:
Place Number Code Value
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 0 0 0 0
4 1 1 1 1
5 2 2 2 2
6 5 5 5 5
7 4 4 4 4
Try this:
df[df.columns] = df[['Code', 'Code', 'Code', 'Code']]
or:
df[df.columns] = df[['Code']*len(df.columns)]
Hope it helps you.

tracking maximum value in dataframe column

I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4

Categories