I am trying to prepare data for some time-series modeling with Python Pandas (first timer). My DataFrame looks like this:
df = pd.DataFrame({
'time': [0, 1, 2, 3, 4],
'colA': ['a', 'b', 'c', 'd', 'e'],
'colB': ['v', 'w', 'x', 'y', 'z'],
'value' : [10, 11, 12, 13, 14]
})
# time colA colB value
# 0 0 a v 10
# 1 1 b w 11
# 2 2 c x 12
# 3 3 d y 13
# 4 4 e z 14
Is there a combination of functions that could transform it into the following format?
# colA-2 colA-1 colA colB-2 colB-1 colB value
# _ _ a _ _ v 10
# _ a b _ v w 11
# a b c v w x 12
# b c d w x y 13
# c d e x y z 14
I am very new to Python/Pandas and I do not have any concrete code/results that got me even close to what I need...
You can use the shift function:
df['colA-2'] =df['colA'].shift(2, fill_value='-' )
df['colA-1'] =df['colA'].shift(1,fill_value='-')
...
I'd use pd.concat
pd.concat([
df[['colA', 'colB']].shift(i).add_suffix(f'-{i}')
for i in range(1, 3)], axis=1
).fillna('-').join(df)
colA-1 colB-1 colA-2 colB-2 time colA colB value
0 - - - - 0 a v 10
1 a v - - 1 b w 11
2 b w a v 2 c x 12
3 c x b w 3 d y 13
4 d y c x 4 e z 14
Related
My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h
I've the following panda data:
df = {'ID_1': [1,1,1,2,2,3,4,4,4,4],
'ID_2': ['a', 'b', 'c', 'f', 'g', 'd', 'v', 'x', 'y', 'z']
}
df = pd.DataFrame(df)
display(df)
ID_1 ID_2
1 a
1 b
1 c
2 f
2 g
3 d
4 v
4 x
4 y
4 z
For each ID_1, I need to find the combination (order doesn't matter) of ID_2. For example,
When ID_1 = 1, the combinations are ab, ac, bc.
When ID_1 = 2, the combination is fg.
Note, if the frequency of ID_1<2, then there is no combination here (see ID_1=3, for example).
Finally, I need to store the combination results in df2 as follows:
One way using itertools.combinations:
from itertools import combinations
def comb_df(ser):
return pd.DataFrame(list(combinations(ser, 2)), columns=["from", "to"])
new_df = df.groupby("ID_1")["ID_2"].apply(comb_df).reset_index(drop=True)
Output:
from to
0 a b
1 a c
2 b c
3 f g
4 v x
5 v y
6 v z
7 x y
8 x z
9 y z
For each row, I am computing values and storing them in a dictionary. I want to be able to take the dictionary and add it to the row where the keys are columns.
For example:
Dataframe
A B C
1 2 3
Dictionary:
{
'D': 4,
'E': 5
}
Result:
A B C D E
1 2 3 4 5
There will be more than one row in the dataframe, and for each row I'm computing a dictionary that might not necessarily have the same exact keys.
I ended up doing this to get it to work:
appiled_df = df.apply(lambda row: func(row['a']), axis='columns', result_type='expand')
df = pd.concat([df, appiled_df], axis='columns')
def func():
...
return pd.Series(dictionary)
If you want the dict values to appear in each row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
Demo
Data Input:
df
A B C
0 1 2 3
1 11 12 13
Output:
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
A B C D E
0 1 2 3 4 5
1 11 12 13 4 5
If you just want the dict to appear in the first row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(pd.Series(d).to_frame().T)
A B C D E
0 1 2 3 4.0 5.0
1 11 12 13 NaN NaN
Simply use a for cycle in your dictionary and assign the values.
df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3]])
# You can test with df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3], [8,0,33]]), too.
d = {
'D': 4,
'E': 5
}
for k,v in d.items():
df[k] = v
print(df)
Output:
A
B
C
D
E
0
1
2
3
4
5
I have this Pandas DataFrame df:
column1 column2
0 x a
1 x b
2 x c
3 y d
4 y e
5 y f
6 y g
7 z h
8 z i
9 z j
How do I group the values in column2 according to the value in column1?
Expected output:
x y z
0 a d h
1 b e i
2 c f j
3 g
I'm new to Pandas, I'd really appreciate your help.
This is a pivot problem with some preprocessing work:
(df.assign(index=df.groupby('column1').transform('cumcount'))
.pivot('index', 'column1', 'column2'))
column1 x y z
index
0 a d h
1 b e i
2 c f j
3 NaN g NaN
We're pivoting using "column1" as the header and "column2" as the values. To make pivoting possible, we need a 3rd column which identifies the uniqueness of the values being pivoted, so we build that with groupby and cumcount.
Somewhat coding is due to the fact that each column in the
solution (res) dataframe is of different size.
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'column1' : ['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z'], 'column2' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})
print(df)
new_columns = df['column1'].unique().tolist() # ['x', 'y', 'z']
res = pd.DataFrame(columns=new_columns)
res[new_columns[0]] = df[df['column1'] == new_columns[0]]['column2'] # adding first column 'x'
for new_column in new_columns[1:]:
new_col_ser = df[df['column1'] == new_column]['column2']
no_of_rows_to_add = len(new_col_ser) - len(res)
for i in range(no_of_rows_to_add):
res.loc[len(res)+1,:] = np.nan
res[new_column][:len(new_col_ser)] = new_col_ser
print(res)
Output:
column1 column2
0 x a
1 x b
2 x c
3 y d
4 y e
5 y f
6 y g
7 z h
8 z i
9 z j
x y z
0 a d h
1 b e i
2 c f j
4 NaN g NaN
i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()
You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13