How to make columns in dataframe to be unique? - python

I wanted to apply one-hot encoding (it isn't important to understand the question) to my dataframe this way:
train = pd.concat([train, pd.get_dummies(train['Canal_ID'])], axis=1, join_axes=[train.index])
train.drop([11,'Canal_ID'],axis=1, inplace = True)
train = pd.concat([train, pd.get_dummies(train['Agencia_ID'])], axis=1, join_axes=[train.index])
train.drop([1382,'Agencia_ID'],axis=1, inplace = True)
Unfortunately, original dataframe had number as values, this is why after getting dummies variables, there are a lot of columns with the same name. How can I make them unique?

Try this: get_dummies has a "prefix" method
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
'C': [1, 2, 3]})
pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 1 0 0 1 0
1 2 0 1 1 0 0
2 3 1 0 0 0 1

You can set new column names by range with shape:
df.columns = range(df.shape[1])
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.shape)
(3, 6)
df.columns = range(df.shape[1])
print (df)
0 1 2 3 4 5
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3

I would append a random number to the original id of the columns.
new_cols = train.columns
new_cols = new_cols.map(lambda x: "{}-{}".format(x, randint(0,100))
train.columns = new_cols

Related

How to concat rows(axis=1) with stride?

example:
import pandas as pd
test = {
't':[0,1,2,3,4,5],
'A':[1,1,1,2,2,2],
'B':[9,9,9,9,8,8],
'C':[1,2,3,4,5,6]
}
df = pd.DataFrame(test)
df
Tried use window and concat:
window_size = 2
for row_idx in range(df.shape[0] - window_size):
print(
pd.concat(
[df.iloc[[row_idx]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size-1]],
df.loc[:, df.columns!='t'].iloc[[row_idx+window_size]]],
axis=1
)
)
But get wrong dataframe like this:
Is it possible to use a sliding window to concat data?
pd.concat is alingning indices, so you have to make sure that they fit. You could try the following:
window_size = 2
dfs = []
for n in range(window_size + 1):
sdf = df.iloc[n:df.shape[0] - window_size + n]
if n > 0:
sdf = (
sdf.drop(columns="t").rename(columns=lambda c: f"{c}_{n}")
.reset_index(drop=True)
)
dfs.append(sdf)
res = pd.concat(dfs, axis=1)
Result for the sample:
t A B C A_1 B_1 C_1 A_2 B_2 C_2
0 0 1 9 1 1 9 2 1 9 3
1 1 1 9 2 1 9 3 2 9 4
2 2 1 9 3 2 9 4 2 8 5
3 3 2 9 4 2 8 5 2 8 6
Have a look at this example below:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df4 = pd.DataFrame([['bird', 'polly'], ['monkey','george']],
columns=['animal', 'name'])
pd.concat([df1, df4], axis=1)
# Returns the following output
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
It was taken from the following pandas doc.

Pandas values transfer from one df to another OverflowError

Is there pandas way to copy values to column 'column_to_fill' from another df without itterations? I have needed me row and column indexes in df_1 columns. I need to fill df_1['column_to_fill'] with values from df_2.
df1 = pd.DataFrame(columns=['row_df2', 'column_df2'])
df1['row_df2'] = [1, 3, 5]
df1['column_df2'] = ['a', 'c', 'd']
index=np.arange(6)
columns=['a', 'b', 'c', 'd']
df2 = pd.DataFrame(data=np.random.randint(10, size=(len(index), len(columns))), index=index, columns=columns)
df1['column_to_fill'] = 0
for idx in df1.index:
df1.loc[idx, 'column_to_fill'] = df2.loc[df1.loc[idx, 'row_df2'],
df1.loc[idx, 'column_df2']].sum()
df1
row_df2 column_df2
0 1 a
1 3 c
2 5 d
df2
a b c d
0 2 3 5 2
1 8 3 9 3
2 4 6 0 1
3 3 8 0 8
4 3 4 5 0
5 2 5 4 0
df1
row_df2 column_df2 column_to_fill
0 1 a 8
1 3 c 0
2 5 d 0
I think you want to pick the value of the df_2 based on the values(row and column combination) of df_1 and assign it to df_1 column. If that is the case then check below.,
df_1 = pd.DataFrame({'values_type_rows_df2':[0,1,0,1], 1:[4,5,6,7]})
df_2 = pd.DataFrame({0:['a','b','c','d'], 1:['e','a','b','c']})
df_1['column_to_fill'] = [df_2.loc[i,i] for i in df_1['values_type_rows_df2']]
Based on your modification of the question, below is the code modified.
df1['column_to_fill'] = [df2.loc[j["row_df2"], j["column_df2"]] for i,j in df1.loc[:,["row_df2", "column_df2"]].iterrows()]
Screenshot attached for the time it took

how to set the index as character for pandas

I am trying to create a pandas df like this post.
df = pd.DataFrame(np.arange(9).reshape(3,3) , columns=list('123'))
df
this piece of code gives
describe() gives
is there is way to set the name of each row (i.e. the index) in df as 'A', 'B', 'C' instead of '0', '1', '2' ?
Use df.index:
df.index=['A', 'B', 'C']
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8
A more scalable and general solution would be using list-comprehension
df.index = [chr(ord('a') + x).upper() for x in df.index]
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8
Add index parameter in DataFrame constructor:
df = pd.DataFrame(np.arange(9).reshape(3,3) ,
index=list('ABC'),
columns=list('123'))
print (df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

Duplicating each row in a dataframe with counts

For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3

Categories