example:
import pandas as pd
test = {
't':[0,1,2,3,4,5],
'A':[1,1,1,2,2,2],
'B':[9,9,9,9,8,8],
'C':[1,2,3,4,5,6]
}
df = pd.DataFrame(test)
df
Tried use window and concat:
window_size = 2
for row_idx in range(df.shape[0] - window_size):
print(
pd.concat(
[df.iloc[[row_idx]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size-1]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size]]],
axis=1
)
)
But get wrong dataframe like this:
Is it possible to use a sliding window to concat data?
pd.concat is alingning indices, so you have to make sure that they fit. You could try the following:
window_size = 2
dfs = []
for n in range(window_size + 1):
sdf = df.iloc[n:df.shape[0] - window_size + n]
if n > 0:
sdf = (
sdf.drop(columns="t").rename(columns=lambda c: f"{c}_{n}")
.reset_index(drop=True)
)
dfs.append(sdf)
res = pd.concat(dfs, axis=1)
Result for the sample:
t A B C A_1 B_1 C_1 A_2 B_2 C_2
0 0 1 9 1 1 9 2 1 9 3
1 1 1 9 2 1 9 3 2 9 4
2 2 1 9 3 2 9 4 2 8 5
3 3 2 9 4 2 8 5 2 8 6
Have a look at this example below:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df4 = pd.DataFrame([['bird', 'polly'], ['monkey','george']],
columns=['animal', 'name'])
pd.concat([df1, df4], axis=1)
# Returns the following output
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
It was taken from the following pandas doc.
Related
I want to shift some columns in the middle of the dataframe to the rightmost.
I could do this with individual column using code:
cols=list(df.columns.values)
cols.pop(cols.index('one_column'))
df=df[cols +['one_column']]
df
But it's inefficient to do it individually when there are 100 columns of 2 series, ie. series1_1... series1_50 and series2_1... series2_50 in the middle of the dataframe.
How can I do it by assigning the 2 series as lists, popping them and putting them back? Maybe something like
cols=list(df.columns.values)
series1 = list(df.loc['series1_1':'series1_50'])
series2 = list(df.loc['series2_1':'series2_50'])
cols.pop('series1', 'series2')
df=df[cols +['series1', 'series2']]
but this didn't work. Thanks
If you just want to shift the columns, you could call concat like this:
cols_to_shift = ['colA', 'colB']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
Or, you could do a little list manipulation on the columns.
cols_to_keep = [c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
Minimal Example
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 6 1 4 4 8
1 4 6 3 5 8
2 7 9 9 2 7
cols_to_shift = ['B', 'C']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
[c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
I think list.pop only takes indices of the elements in the list.
You should list.remove instead.
cols = df.columns.tolist()
for s in (‘series1’, ‘series2’):
cols.remove(s)
df = df[cols + [‘series1’, ‘series2’]]
I have a "sample.txt" like this.
idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z
With this dataset, I want to get sum in row and column.
In row, it is not a big deal.
I made result like this.
### MY CODE ###
import pandas as pd
df = pd.read_csv('sample.txt',sep="\t",index_col='idx')
df.info()
df2 = df.groupby('cat').sum()
print( df2 )
The result is like this.
A B C D
cat
x 5 7 9 3
y 8 10 12 7
z 11 13 15 11
But I don't know how to write a code to get result like this.
(simply add values in column A and B as well as column C and D)
AB CD
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Could anybody help how to write a code?
By the way, I don't want to do like this.
(it looks too dull, but if it is the only way, I'll deem it)
df2 = df['A'] + df['B']
df3 = df['C'] + df['D']
df = pd.DataFrame([df2,df3],index=['AB','CD']).transpose()
print( df )
When you pass a dictionary or callable to groupby it gets applied to an axis. I specified axis one which is columns.
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).sum()
Use concat with sum:
df = df.set_index('idx')
df = pd.concat([df[['A', 'B']].sum(1), df[['C', 'D']].sum(1)], axis=1, keys=['AB','CD'])
print( df)
AB CD
idx
J 3 4
K 9 8
L 15 12
M 3 7
N 9 11
O 15 15
Does this do what you need? By using axis=1 with DataFrame.apply, you can use the data that you want in a row to construct a new column. Then you can drop the columns that you don't want anymore.
In [1]: import pandas as pd
In [5]: df = pd.DataFrame(columns=['A', 'B', 'C', 'D'], data=[[1, 2, 3, 4], [1, 2, 3, 4]])
In [6]: df
Out[6]:
A B C D
0 1 2 3 4
1 1 2 3 4
In [7]: df['CD'] = df.apply(lambda x: x['C'] + x['D'], axis=1)
In [8]: df
Out[8]:
A B C D CD
0 1 2 3 4 7
1 1 2 3 4 7
In [13]: df.drop(['C', 'D'], axis=1)
Out[13]:
A B CD
0 1 2 7
1 1 2 7
Screenshot of the query below:
Is there a way to easily drop the upper level column index and a have a single level with labels such as points_prev_amax, points_prev_amin, gf_prev_amax, gf_prev_amin and so on?
Use list comprehension for set new column names:
df.columns = df.columns.map('_'.join)
Or:
df.columns = ['_'.join(col) for col in df.columns]
Sample:
df = pd.DataFrame({'A':[1,2,2,1],
'B':[4,5,6,4],
'C':[7,8,9,1],
'D':[1,3,5,9]})
print (df)
A B C D
0 1 4 7 1
1 2 5 8 3
2 2 6 9 5
3 1 4 1 9
df = df.groupby('A').agg([max, min])
df.columns = df.columns.map('_'.join)
print (df)
B_max B_min C_max C_min D_max D_min
A
1 4 4 7 1 9 1
2 6 5 9 8 5 3
print (['_'.join(col) for col in df.columns])
['B_max', 'B_min', 'C_max', 'C_min', 'D_max', 'D_min']
df.columns = ['_'.join(col) for col in df.columns]
print (df)
B_max B_min C_max C_min D_max D_min
A
1 4 4 7 1 9 1
2 6 5 9 8 5 3
If need prefix simple swap items of tuples:
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
print (df)
max_B min_B max_C min_C max_D min_D
A
1 4 4 7 1 9 1
2 6 5 9 8 5 3
Another solution:
df.columns = ['{}_{}'.format(i[1], i[0]) for i in df.columns]
print (df)
max_B min_B max_C min_C max_D min_D
A
1 4 4 7 1 9 1
2 6 5 9 8 5 3
If len of columns is big (10^6), then rather use to_series and str.join:
df.columns = df.columns.to_series().str.join('_')
Using #jezrael's setup
df = pd.DataFrame({'A':[1,2,2,1],
'B':[4,5,6,4],
'C':[7,8,9,1],
'D':[1,3,5,9]})
df = df.groupby('A').agg([max, min])
Assign new columns with
from itertools import starmap
def flat(midx, sep=''):
fstr = sep.join(['{}'] * midx.nlevels)
return pd.Index(starmap(fstr.format, midx))
df.columns = flat(df.columns, '_')
df
I wanted to apply one-hot encoding (it isn't important to understand the question) to my dataframe this way:
train = pd.concat([train, pd.get_dummies(train['Canal_ID'])], axis=1, join_axes=[train.index])
train.drop([11,'Canal_ID'],axis=1, inplace = True)
train = pd.concat([train, pd.get_dummies(train['Agencia_ID'])], axis=1, join_axes=[train.index])
train.drop([1382,'Agencia_ID'],axis=1, inplace = True)
Unfortunately, original dataframe had number as values, this is why after getting dummies variables, there are a lot of columns with the same name. How can I make them unique?
Try this: get_dummies has a "prefix" method
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1
You can set new column names by range with shape:
df.columns = range(df.shape[1])
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.shape)
(3, 6)
df.columns = range(df.shape[1])
print (df)
0 1 2 3 4 5
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
I would append a random number to the original id of the columns.
new_cols = train.columns
new_cols = new_cols.map(lambda x: "{}-{}".format(x, randint(0,100))
train.columns = new_cols
Lets say I have a Pandas DataFrame of the following form:
a b c
a_1 1 4 2
a_2 3 3 5
a_3 4 7 2
b_1 2 9 8
b_2 7 2 6
b_3 5 4 1
c_1 3 1 3
c_2 8 6 6
c_3 9 3 7
Is there a way I could select only rows that have similar names? In the case of the DataFrame above that would mean selecting only the rows that start with a, or the rows that start with b, etc.
Using #Akavall setup code
df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
In [1]: my_data = np.arange(8).reshape(4,2)
In [2]: my_data[0,0] = 4
In [3]: df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
In [5]: df.filter(regex='a',axis=0)
Out[5]:
a b
a_1 4 1
a_2 2 3
[2 rows x 2 columns]
Note that in general this is better posed as a multi-index
In [6]: df.index = MultiIndex.from_product([['a','b'],[1,2]])
In [7]: df
Out[7]:
a b
a 1 4 1
2 2 3
b 1 4 5
2 6 7
[4 rows x 2 columns]
In [8]: df.loc['a']
Out[8]:
a b
1 4 1
2 2 3
[2 rows x 2 columns]
In [9]: df.loc[['a']]
Out[9]:
a b
a 1 4 1
2 2 3
[2 rows x 2 columns]
I don't think that there is a build-in pandas way to do it, but here is one way:
import numpy as np
import pandas as pd
my_data = np.arange(8).reshape(4,2)
my_data[0,0] = 4
df = pd.DataFrame(data = my_data, index=['a_1', 'a_2', 'b_1', 'b_2'], columns=['a', 'b'])
Result:
>>> df
a b
a_1 4 1
a_2 2 3
b_1 4 5
b_2 6 7
>>> start_with_a = [ind for ind, ele in enumerate(df.index) if ele[0] == 'a']
>>> start_with_a
[0, 1]
>>> df.loc[start_with_a]
a b
a_1 4 1
a_2 2 3
in general you can access the row index and the columns with the .index and .columns attributes.
so you can easily get the rows that start with a programmatically
needed_rows = [row for row in df.index if row.startswith('a')]
then you can use these rows like this
df.loc[needed_rows]