Basically i want to just flatten ( maybe not good term )
for example having dataframe:
A B C
0 1 [1,2] [1, 10]
1 2 [2, 14] [2, 18]
I want to get the output of:
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
I've tried:
print(pd.DataFrame(df.values.flatten().tolist(), columns=['%sG'%i for i in range(6)], index=df.index))
But nothing good.
Hope you get what i mean :)
General solution working also if lists have differents lengths:
df1 = pd.DataFrame(df['B'].values.tolist())
df2 = pd.DataFrame(df['C'].values.tolist())
df = pd.concat([df[['A']], df1, df2], axis=1)
df.columns = [df.columns[0]] + [f'B{i+1}' for i in range(len(df.columns)-1)]
print (df)
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
If same size:
df1 = pd.DataFrame(np.array(df[['B','C']].values.tolist()).reshape(len(df),-1))
df1.columns = [f'B{i+1}' for i in range(len(df1.columns))]
df1.insert(0, 'A', df['A'])
print (df1)
A B1 B2 B3 B4
0 1 1 2 1 10
1 2 2 14 2 18
In more recent versions you can use explode:
>>> x = df.select_dtypes(exclude=list).join(df.select_dtypes(list).apply(pd.Series.explode, axis=1))
>>> x.columns = x.columns + x.columns.to_series().groupby(level=0).cumcount().add(1).astype(str)
>>> x
A1 B1 B2 C1 C2
0 1 1 2 1 10
1 2 2 14 2 18
>>>
Related
I have the following dataframe with two columns c1 and c2, I want to add a new column c3 based on the following logic, what I have works but is slow, can anyone suggest a way to vectorize this?
Must be grouped based on c1 and c2, then for each group, the new column c3 must be populated sequentially from values where the key is the value of c1 and each "sub group" will have subsequent values, IOW values[value_of_c1][idx], where idx is the "sub group", example below
The first group (1, 'a'), here c1 is 1, the "sub group" "a" index is 0 (first sub group of 1) so c3 for all rows in this group is values[1][0]
The second group (1, 'b') here c1 is still 1 but "sub group" is "b" so index 1 (second sub group of 1) so for all rows in this group c3 is values[1][1]
The third group (2, 'y') here c1 is now 2, "sub group" is "a" and the index is 0 (first sub group of 2), so for all rows in this group c3 is values[2][0]
And so on
values will have the necessary elements to satisfy this logic.
Code
import pandas as pd
df = pd.DataFrame(
{
"c1": [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
"c2": ["a", "a", "a", "b", "b", "b", "y", "y", "y", "z", "z", "z"],
}
)
new_df = pd.DataFrame()
values = {1: ["a1", "a2"], 2: ["b1", "b2"]}
for i, j in df.groupby("c1"):
for idx, (k, l) in enumerate(j.groupby("c2")):
l["c3"] = values[i][idx]
new_df = new_df.append(l)
Output (works but my code is slow)
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
If you don't mind using another library, you basically need to label encode within your groups:
from sklearn.preprocessing import LabelEncoder
def le(x):
return pd.DataFrame(LabelEncoder().fit_transform(x),index=x.index)
df['idx'] = df.groupby('c1')['c2'].apply(le)
df['c3'] = df.apply(lambda x:values[x['c1']][x['idx']],axis=1)
c1 c2 idx c3
0 1 a 0 a1
1 1 a 0 a1
2 1 a 0 a1
3 1 b 1 a2
4 1 b 1 a2
5 1 b 1 a2
6 2 y 0 b1
7 2 y 0 b1
8 2 y 0 b1
9 2 z 1 b2
10 2 z 1 b2
11 2 z 1 b2
Otherwise it's a matter of using pd.Categorical , same concept as above, just that you convert within each group, the column into a category and just pull out the code:
def le(x):
return pd.DataFrame(pd.Categorical(x).codes,index=x.index)
In [203]: a = pd.DataFrame([[k, value, idx] for k,v in values.items() for idx,value in enumerate(v)], columns=['c1', 'c3', 'gr'])
...: b = df.assign(gr=df.groupby(['c1']).transform(lambda x: (x.ne(x.shift()).cumsum())- 1))
...: print(b)
...: b.merge(a).drop(columns='gr')
...:
# b
c1 c2 gr
0 1 a 0
1 1 a 0
2 1 a 0
3 1 b 1
4 1 b 1
5 1 b 1
6 2 y 0
7 2 y 0
8 2 y 0
9 2 z 1
10 2 z 1
11 2 z 1
Out[203]:
c1 c2 c3
0 1 a a1
1 1 a a1
2 1 a a1
3 1 b a2
4 1 b a2
5 1 b a2
6 2 y b1
7 2 y b1
8 2 y b1
9 2 z b2
10 2 z b2
11 2 z b2
I want to create a data frame as below.
C1 C2 C3 C4
1 1 1 1
1 2 2 2
1 2 2 3
1 2 3 4
1 2 3 5
2 3 4 6
3 4 5 7
The C4 column should be unique values. C4 column values should belongs to any one of the C3 column. C3 column values should belongs to any one of the C2 column. C2 column values should belongs to any one of the C1 column. The column values should be C1 < C2 < C3 < C4. The values may be random.
I used below sample Python code.
import pandas as pd
import numpy as np
C1 = [1, 2]
C2 = [1, 2, 3,4]
C3 = [1, 2, 3, 4,5]
C4 = [1,2,3,4,5,6,7]
Z = [C1,C2,C3,C4]
n = max(len(x) for x in Z)
a = [np.hstack((np.random.choice(x, n - len(x)), x)) for x in Z]
df = pd.DataFrame(a, index=['C1', 'C2', 'C3','C4']).T.sample(frac=1)
print (df)
Below is my output.
C1 C2 C3 C4
1 4 2 2
1 3 4 6
2 1 2 4
1 2 5 1
2 4 5 7
1 2 3 5
2 2 1 3
But I couldn’t get the output as per my logic. The value 2 in C3 column belongs to 1 and 4 of C2 column. Also the value 2 in C2 column belongs to 1 and 2 of C1 column. Guide me to get output as per my logic. Thanks.
I am new to python and developing a code
I want to search for a word in a column and if a match is found, i want to insert an empty row below that.
My code is below
If df.columnname=='total':
Df.insert
Could someone pls help me.
Do give the following a try:
>>>df
id Label
0 1 A
1 2 B
2 3 B
3 4 B
4 5 A
5 6 B
6 7 A
7 8 A
8 9 C
9 10 C
10 11 C
# Create a separate dataframe with the id of the rows to be duplicated
df1 = df.loc[df['Label']=='B', 'id']
# Join it back and reset the index
df = pd.concat(df,df1).sort_index()
>>>df
id Label
0 1 A
1 2 B
2 2 NaN
3 3 B
4 3 NaN
5 4 B
6 4 NaN
7 5 A
8 6 B
9 6 NaN
10 7 A
11 8 A
12 9 C
13 10 C
14 11 C
Use below code:
from numpy import nan as Nan
import pandas as pd
df1 = pd.DataFrame({'Column1': ['A0', 'total', 'total', 'A3'],'Column2': ['B0', 'B1',
'B2', 'B3'],'Column3': ['C0', 'C1', 'C2', 'C3'],'Column4': ['D0', 'D1', 'D2',
'D3']},index=[0, 1, 2, 3])
count = 0
for index, row in df1.iterrows():
if row["Column1"] == 'total':
df1 = pd.DataFrame(np.insert(df1.values, index+1+count, values=[" "]
* len(df1.columns), axis=0),columns = df1.columns)
count += 1
print (df1)
Input:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2 total B2 C2 D2
3 A3 B3 C3 D3
Output:
Column1 Column2 Column3 Column4
0 A0 B0 C0 D0
1 total B1 C1 D1
2
3 total B2 C2 D2
4
5 A3 B3 C3 D3
Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1
Given these two pandas data frames:
>>> df1 = pd.DataFrame({'c1':['a','b','c','d'], 'c':['x','y','y','x']})
c1 c2
0 a x
1 b y
2 c y
3 d x
>>> df2 = pd.DataFrame({'c1':['d','c','a','b'], 'val1':[12,31,14,34], 'val2':[0,0,1,1]})
c1 val1 val2
0 d 12 4
1 c 31 3
2 a 14 1
3 b 34 2
I'd like to append the columns val1 and val2 of df2 to the data frame df1, taking into account the elements in c1. The updated df1 would then look like:
>>> df1
c1 c2 val1 val2
0 a x 14 1
1 b y 34 2
2 c y 31 3
3 d x 12 4
I thought of using a combination of set_index and update:
df1.set_index('c1').update(df2.set_index('c1')), but it didn't work...
You could use pd.merge:
import pandas as pd
df1 = pd.DataFrame({'c1':['a','b','c','d'], 'c2':['x','y','y','x']})
df2 = pd.DataFrame({'c1':['d','c','a','b'], 'val1':[12,31,14,34], 'val2':[4,3,1,2]})
df1 = pd.merge(df1, df2, on=['c1'])
print(df1)
yields
c1 c2 val1 val2
0 a x 14 1
1 b y 34 2
2 c y 31 3
3 d x 12 4