Split rows into multiple rows with pandas - python

I have a dataset in the following format. It got 48 columns and about 200000 rows.
slot1,slot2,slot3,slot4,slot5,slot6...,slot45,slot46,slot47,slot48
1,2,3,4,5,6,7,......,45,46,47,48
3.5,5.2,2,5.6,...............
I want to reshape this dataset to something as below, where N is less than 48 (maybe 24 or 12 etc..) column headers doesn't matter.
when N = 4
slotNew1,slotNew2,slotNew3,slotNew4
1,2,3,4
5,6,7,8
......
45,46,47,48
3.5,5.2,2,5.6
............
I can read row by row and then split each row and append to a new dataframe. But that is very inefficient. Is there any efficient and faster way to do that?

You may try this
N = 4
df_new = pd.DataFrame(df_original.values.reshape(-1, N))
df_new.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
The code extracts the data into numpy.ndarray, reshape it, and create a new dataset of desired dimension.
Example:
import numpy as np
import pandas as pd
df0 = pd.DataFrame(np.arange(48 * 3).reshape(-1, 48))
df0.columns = ['slot{:}'.format(i + 1) for i in range(48)]
print(df0)
# slot1 slot2 slot3 slot4 ... slot45 slot46 slot47 slot48
# 0 0 1 2 3 ... 44 45 46 47
# 1 48 49 50 51 ... 92 93 94 95
# 2 96 97 98 99 ... 140 141 142 143
#
# [3 rows x 48 columns]
N = 4
df = pd.DataFrame(df0.values.reshape(-1, N))
df.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
print(df.head())
# slotNew1 slotNew2 slotNew3 slotNew4
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
# 4 16 17 18 19
Another approach
N = 4
df1 = df0.stack().reset_index()
df1['i'] = df1['level_1'].str.replace('slot', '').astype(int) // N
df1['j'] = df1['level_1'].str.replace('slot', '').astype(int) % N
df1['i'] -= (df1['j'] == 0) - df1['level_0'] * 48 / N
df1['j'] += (df1['j'] == 0) * N
df1['j'] = 'slotNew' + df1['j'].astype(str)
df1 = df1[['i', 'j', 0]]
df = df1.pivot(index='i', columns='j', values=0)

Use pandas.explode after making chunks. Given df:
import pandas as pd
df = pd.DataFrame([np.arange(1, 49)], columns=['slot%s' % i for i in range(1, 49)])
print(df)
slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 slot9 slot10 ... \
0 1 2 3 4 5 6 7 8 9 10 ...
slot39 slot40 slot41 slot42 slot43 slot44 slot45 slot46 slot47 \
0 39 40 41 42 43 44 45 46 47
slot48
0 48
Using chunks to divide:
def chunks(l, n):
"""Yield successive n-sized chunks from l.
Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n_items = len(l)
if n_items % n:
n_pads = n - n_items % n
else:
n_pads = 0
l = l + [np.nan for _ in range(n_pads)]
for i in range(0, len(l), n):
yield l[i:i + n]
N = 4
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
...
Advantage of this approach over numpy.reshape is that it can handle when N is not a factor:
N = 7
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7.0
1 8 9 10 11 12 13 14.0
2 15 16 17 18 19 20 21.0
3 22 23 24 25 26 27 28.0
4 29 30 31 32 33 34 35.0
5 36 37 38 39 40 41 42.0
6 43 44 45 46 47 48 NaN

Related

How do you lookup in range

I have 2 data frames that I would like to return the values in a range (-1, 0, +1). One of the data frames contains Id's that i would like to look up and the other data frame contains Id's & values. For example, I want to lookup 99, 55, 117 in another data frame and return 100 99 98, 56 55 54, 118 117 116. As you can see it getting the values -1 and +1 of the Id's I would like to lookup. There is a better example below.
df = pd.DataFrame([[99],[55],[117]],columns = ['Id'])
df2 = pd.DataFrame([[100,1,2,4,5,6,8],
[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46],
[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10],
[77,1,2,3,5,6,8],
[811,1,2,5,6,8,10],
[118,1,7,8,22,44,56],
[117,1,66,44,47,87,91]],
columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6'])
I would like my result to something like this below.
results = pd.DataFrame([[87,1,6,20,22,23,34],
[99,1,12,13,34,45,46],
[64,1,10,14,29,32,33],
[64,1,10,14,29,32,33],
[55,1,22,13,23,33,35],
[66,1,6,7,8,9,10],
[118,1,7,8,22,44,56],
[117,1,66,44,47,87,91]],
columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6'])
import pandas as pd
import numpy as np
ind = df2[df2['Id'].isin(df['Id'])].index
aaa = np.array([[ind[i]-1,ind[i],ind[i]+1] for i in range(len(ind))]).ravel()
aaa = aaa[(aaa <= df2.index.values[-1]) & (aaa >= 0)]
df_test = df2.loc[aaa, :].reset_index().drop(['index'], axis=1)
print(df_test)
Output
Id Num1 Num2 Num3 Num4 Num5 Num6
0 87 1 6 20 22 23 34
1 99 1 12 13 34 45 46
2 64 1 10 14 29 32 33
3 64 1 10 14 29 32 33
4 55 1 22 13 23 33 35
5 66 1 6 7 8 9 10
6 118 1 7 8 22 44 56
7 117 1 66 44 47 87 91
Here, in the ind list, indexes are obtained where there are the required Ids in df2.
The aaa list creates ranges for these indexes, then the lists are wrapped in np.array, ravel() is used to concatenate them. Next, the list aaa is overwritten, the elements that are greater than the maximum index df2 are removed.
Sampling occurs through loc.
Update 17.12.2022
if you need duplicate rows.
df = pd.DataFrame([[99], [55], [117], [117]], columns=['Id'])
lim_ind = df2.index[-1]
def my_func(i):
a = df2[df2['Id'].isin([i])].index.values
a = np.array([a - 1, a, a + 1]).ravel()
a = a[(a >= 0) & (a <= lim_ind)]
return a
qqq = [my_func(i) for i in df['Id']]
fff = np.array([df2.loc[qqq[i]].values for i in range(len(qqq))], dtype=object)
fff = np.vstack(fff)
result = pd.DataFrame(fff, columns=df2.columns)
print(result)
Output
Id Num1 Num2 Num3 Num4 Num5 Num6
0 87 1 6 20 22 23 34
1 99 1 12 13 34 45 46
2 64 1 10 14 29 32 33
3 64 1 10 14 29 32 33
4 55 1 22 13 23 33 35
5 66 1 6 7 8 9 10
6 118 1 7 8 22 44 56
7 117 1 66 44 47 87 91
8 118 1 7 8 22 44 56
9 117 1 66 44 47 87 91

Multiply columns in Dataframe where columns are pd.MultiIndex

I want to multiply 2 columns (A*B) in a DataFrame where columns are pd.MultiIndex.
I want to perform this multiplication for each DataX (Data1, Data2, ...) column in columns level=0.
df = pd.DataFrame(data= np.arange(32).reshape(8,4),
columns = pd.MultiIndex.from_product(iterables = [["Data1","Data2"],["A","B"]]))
Data1 Data2
A B A B
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
The result of multiplication should be also a DataFrame with columns=pd.MultiIndex (see below).
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
I managed to perform this multiplication by iterating over columns, level=0,but looking a better way to do it.
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
Any suggestions or hints much appreciated!
Thanks
Here is another alternative using df.prod and df.join
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
out = df.join(u)
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930
Slice out the 'A' and 'B' along the first level of the columns Index. Then you can multiply which will align on the 0th level ('Data1', 'Data2'). We'll then re-create the MultiIndex on the columns and join back
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
df = pd.concat([df, df1], axis=1)
Here are some timings assuming you have 2 groups (Data1, Data2) and your DataFrame just gets longer. Turns out, the simple loop might be the fastest of them all. (I added some sorting and needed to copy them all so the output is the same).
import perfplot
import pandas as pd
import numpy as np
##Tom
def simple_loop(df):
for _ in df.columns.get_level_values(level=0).unique().tolist()[:]:
df[(_, "A*B")] = df[(_, "A")] * df[(_, "B")]
return df.sort_index(axis=1)
##Roy2012
def mul_with_stack(df):
df = df.stack(level=0)
df["A*B"] = df.A * df.B
return df.stack().swaplevel().unstack(level=[2,1]).sort_index(axis=1)
##Alollz
def xs_concat(df):
df1 = df.xs('A', axis=1, level=1).multiply(df.xs('B', axis=1, level=1))
df1.columns = pd.MultiIndex.from_product([df1.columns, ['A*B']])
return pd.concat([df, df1], axis=1).sort_index(axis=1)
##anky
def prod_join(df):
u = df.prod(axis=1,level=0)
u.columns=pd.MultiIndex.from_product((u.columns,['*'.join(df.columns.levels[1])]))
return df.join(u).sort_index(axis=1)
perfplot.show(
setup=lambda n: pd.DataFrame(data=np.arange(4*n).reshape(n, 4),
columns =pd.MultiIndex.from_product(iterables=[["Data1", "Data2"], ["A", "B"]])),
kernels=[
lambda df: simple_loop(df.copy()),
lambda df: mul_with_stack(df.copy()),
lambda df: xs_concat(df.copy()),
lambda df: prod_join(df.copy())
],
labels=['simple_loop', 'stack_and_multiply', 'xs_concat', 'prod_join'],
n_range=[2 ** k for k in range(3, 20)],
equality_check=np.allclose,
xlabel="len(df)"
)
Here's a way to do it with stack and unstack. The advantage: fully vectorized, no loops, no join operations.
t = df.stack(level=0)
t["A*B"] = t.A * t.B
t = t.stack().swaplevel().unstack(level=[2,1])
The output is:
Data1 Data2
A B A*B A B A*B
0 0 1 0 2 3 6
1 4 5 20 6 7 42
2 8 9 72 10 11 110
3 12 13 156 14 15 210
4 16 17 272 18 19 342
Another alternative here, using prod :
df[("Data1", "A*B")] = df.loc(axis=1)["Data1"].prod(axis=1)
df[("Data2", "A*B")] = df.loc(axis=1)["Data2"].prod(axis=1)
df
Data1 Data2 Data1 Data2
A B A B A*B A*B
0 0 1 2 3 0 6
1 4 5 6 7 20 42
2 8 9 10 11 72 110
3 12 13 14 15 156 210
4 16 17 18 19 272 342
5 20 21 22 23 420 506
6 24 25 26 27 600 702
7 28 29 30 31 812 930

rate based on the few condition

I want add rate based on the conditions in few columns
if A > 30 +1 and B > 50 +1 and C > 80 +1, D doesn't matter,
for example i have a matrix (dataframe):
A B C D
0 21 32 84 43 # 0 + 0 + 1
1 79 29 42 63 # 1 + 0 + 0
2 31 38 6 52 # 1 + 0 + 0
3 92 54 79 75 # 1 + 1 + 0
4 9 14 87 85 # 0 + 0 + 1
what i try:
In [1]: import numpy as np
In [2]: import pandas as pd
In [36]: df = pd.DataFrame(
np.random.randint(0,100,size=(5, 4)),
columns=list('ABCD')
)
In [36]: df
Out[36]:
A B C D
0 21 32 84 43
1 79 29 42 63
2 31 38 6 52
3 92 54 79 75
4 9 14 87 8
create series (df['A'] > 30)
concat it to the frame
and sum rows
In [37]: df['R'] = pd.concat(
[(df['A'] > 30), (df['B'] > 50), (df['C'] > 80)], axis=1
).sum(axis=1)
In [38]: df
Out[38]:
A B C D R
0 21 32 84 43 1
1 79 29 42 63 1
2 31 38 6 52 1
3 92 54 79 75 2
4 9 14 87 85 1
And result as i expected, but maybe there are more simple way?
You can just do this:
df['R'] = (df.iloc[:,:3]>[30, 50, 80]).sum(axis=1)
the same solution using column names
df['R'] = (df[['A','B','C']]>[30, 50, 80]).sum(axis=1)
How about
df["R"] = (
(df["A"] > 30).astype(int) +
(df["B"] > 50).astype(int) +
(df["C"] > 80).astype(int)
)
You can also try this. Not sure if it is any better.
>>> df
A B C D
0 8 47 95 52
1 90 84 39 80
2 15 52 37 79
3 99 24 76 5
4 93 4 97 0
>>> df.apply(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64
>>> df.agg(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64

Why was my dataframe column changed?

My code
import pandas as pd
import numpy as np
series = pd.read_csv('o1.csv', header=0)
s1 = series
s2 = series
s1['userID'] = series['userID'] + 5
s1['adID'] = series['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = series['adID'] + 4
r1=series.append(s1)
r2=r1.append(s2)
print(r2)
I got something wrong,now columns are exactly the same.
Output
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
3 12 f 107 50
4 12 f 108 100
5 13 m 109 62
6 13 m 114 28
7 13 m 108 36
8 12 f 109 74
9 12 f 114 100
10 14 m 108 62
11 14 m 109 28
12 15 f 116 50
13 15 f 117 100
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
I didn't want my series column to be changed.
Why did it happened?
How to change this?
Do I need to use iloc?
IIUC need copy if need new object DataFrame:
s1 = series.copy()
s2 = series.copy()
Sample:
print (df)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
s1 = df.copy()
s2 = df.copy()
s1['userID'] = df['userID'] + 5
s1['adID'] = df['adID'] + 3
s2['userID'] = s1['userID'] + 5
s2['adID'] = df['adID'] + 4
r1=df.append(s1)
r2=r1.append(s2)
print(r2)
userID gender adID rating
0 11 m 107 50
1 11 m 108 100
2 11 m 109 0
0 16 m 110 50
1 16 m 111 100
2 16 m 112 0
0 21 m 111 50
1 21 m 112 100
2 21 m 113 0

Task is create a table which contain 2 lines, where each value is sum of all numbers between 2 lines. Solve the program with funtction

Prompt from use two values A and B. Produce table AxB, where each value is sum of all numbers between A and B, (or B and A if B < A). Create functions to generate each line, and function to print all lines.
I solve this problem by using "if" operator, but I don't know how to solve by using function. My way:
a = input("Enter A:")
b = input("Enter B:")
k = 0
for i in range(1,a+1):
for j in range(1,b+1):
if i != j:
k = 0
if j < i:
m = j
ma = i
else:
m = i
ma = j
for m in range(m,ma+1):
k += m
print k,
else:
print i,
print
Simple sample:
Enter A: 3
Enter B: 4
Result should be:
1 3 6 10
3 2 5 9
6 5 3 7
This is probably what you would want
from itertools import product
from math import log10
def foo(row, col):
table = [[0]*col for _ in range(row)]
for i, j in product(range(row), range(col)):
table[i][j] = sum(range(i+1, j+2)) if i < j else sum(range(j+1, i+2))
_max = max(row, col)
_max = int(log10(_max*(_max+1) / 2)) + 3
formatstr = ("{{:^{}}}".format(_max))*col
for row in table:
print formatstr.format(*row)
>>> foo(3,4)
1 3 6 10
3 2 5 9
6 5 3 7
>>> foo(10,10)
1 3 6 10 15 21 28 36 45 55
3 2 5 9 14 20 27 35 44 54
6 5 3 7 12 18 25 33 42 52
10 9 7 4 9 15 22 30 39 49
15 14 12 9 5 11 18 26 35 45
21 20 18 15 11 6 13 21 30 40
28 27 25 22 18 13 7 15 24 34
36 35 33 30 26 21 15 8 17 27
45 44 42 39 35 30 24 17 9 19
55 54 52 49 45 40 34 27 19 10

Categories