Add an index column in csv file - python

l have the following sample to transform. After concatenating several csv files l keep the index of each row 0 up to last row of the file in each file as depicted below.
Column_1 column2
0 m 4
1 n 3
2 4 6
3 t 8
0 h 8
1 4 7
2 kl 8
3 m 4
4 bv 5
5 n 8
Now l want to add another column in the beginning indexing the file.
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8

Simpliest is MultiIndex.from_arrays by numpy.arange or range:
print (np.arange(len(df.index)))
[0 1 2 3 4 5 6 7 8 9]
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([np.arange(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([range(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
If index names are not necessary, simply assign:
df.index = [np.arange(len(df.index)), df.index]
print (df)
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8

Related

pandas dataframe groupby rank generates unexpected order of ranking

I am using the following code to generate the rank column,
df["rank"] = df.groupby(['group1','userId'])[['rank_level1','rank_level2']].rank(method='first', ascending=True).astype(int)
but as you can see in the following example data it is generating the wrong order of ranking considering rank_level2 column
expected_Rank is the ranking order I am expecting
group1
userId
geoId
rank_level1
rank_level2
rank
expected_Rank
a
1
q
3
3.506102795
1
8
a
1
w
3
-9.359613563
2
2
a
1
e
3
-2.368458072
3
3
a
1
r
3
13.75731938
4
9
a
1
t
3
0.229777761
5
5
a
1
y
3
-10.25124866
6
1
a
1
u
3
2.82822285
7
7
a
1
i
3
0
8
4
a
1
o
3
1.120593402
9
6
a
1
p
4
1.98
10
10
a
1
z
4
5.110299374
11
11
b
1
p
2
-9.552317622
1
1
b
1
r
3
1.175083121
2
6
b
1
t
3
0
3
5
b
1
o
3
9.383253146
4
8
b
1
w
3
5.782528196
5
7
b
1
i
3
-0.680999413
6
4
b
1
y
3
-0.990387248
7
3
b
1
e
3
-11.18793533
8
2
b
1
z
3
12.33791512
9
9
b
1
u
4
-4.799979138
10
11
b
1
q
4
-25.92
11
10
Create tuples by both columns and then use GroupBy.transform with Series.rank and method='dense':
df["rank"] = (df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1))
.groupby(['group1','userId'])['new']
.transform(lambda x: x.rank(method='dense', ascending=True))
.astype(int))
print (df)
group1 userId geoId rank_level1 rank_level2 rank expected_Rank
0 a 1 q 3 3.506103 8 8
1 a 1 w 3 -9.359614 2 2
2 a 1 e 3 -2.368458 3 3
3 a 1 r 3 13.757319 9 9
4 a 1 t 3 0.229778 5 5
5 a 1 y 3 -10.251249 1 1
6 a 1 u 3 2.828223 7 7
7 a 1 i 3 0.000000 4 4
8 a 1 o 3 1.120593 6 6
9 a 1 p 4 1.980000 10 10
10 a 1 z 4 5.110299 11 11
11 b 1 p 2 -9.552318 1 1
12 b 1 r 3 1.175083 6 6
13 b 1 t 3 0.000000 5 5
14 b 1 o 3 9.383253 8 8
15 b 1 w 3 5.782528 7 7
16 b 1 i 3 -0.680999 4 4
17 b 1 y 3 -0.990387 3 3
18 b 1 e 3 -11.187935 2 2
19 b 1 z 3 12.337915 9 9
20 b 1 u 4 -4.799979 11 11
21 b 1 q 4 -25.920000 10 10
because:
df["rank"] = df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1)).groupby(['group1','userId'])['new'].rank(method='first', ascending=True).astype(int)
DataError: No numeric types to aggregate

Python dataframe add columns in groups of 3

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

add_suffix to column name based on position

I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories