Python dataframe add columns in groups of 3 - python

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?

First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1

IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

Related

Grid-like dataframe to list

I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1

add_suffix to column name based on position

I have a dataset where I want to add a suffix to column names based on their positions. For ex- 1st to 4th columns should be named 'abc_1', then 5th to 8th columns as 'abc_2' and so on.
I have tried using dataframe.rename
but it is a time consuming process. What would be the most efficient way to achieve this?
I think here is good choice create MultiIndex for avoid duplicated columns names - create first level by floor divide by 4 and add prefix by f-strings:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(5, 10)))
df.columns = [[f'abc_{i+1}' for i in df.columns // 4], df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
More general solution if no RangeIndex in column names:
cols = [f'abc_{i+1}' for i in np.arange(len(df.columns)) // 4]
df.columns = [cols, df.columns]
print (df)
abc_1 abc_2 abc_3
0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Also is possible specify MultiIndex levels names by MultiIndex.from_arrays:
df.columns = pd.MultiIndex.from_arrays([cols, df.columns], names=('level0','level1'))
print (df)
level0 abc_1 abc_2 abc_3
level1 0 1 2 3 4 5 6 7 8 9
0 2 2 6 1 3 9 6 1 0 1
1 9 0 0 9 3 4 0 0 4 1
2 7 3 2 4 7 2 4 8 0 7
3 9 3 4 6 1 5 6 2 1 8
4 3 5 0 2 6 2 4 4 6 3
Then is possible select each level by xs:
print (df.xs('abc_2', axis=1))
4 5 6 7
0 3 9 6 1
1 3 4 0 0
2 7 2 4 8
3 1 5 6 2
4 6 2 4 4

need to filter rows present in one dataframe on another

I have two data frames in pandas from which i need to get the rows with all the corresponding column values in second which are not in first .
ex
df A
A B C D
6 4 1 6
7 6 6 3
1 6 2 9
8 0 4 9
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
3 2 8 8
5 2 8 8
df B
A B C D
1 0 2 3
8 4 7 5
4 7 1 1
1 0 2 3
8 4 7 5
4 7 1 1
3 7 3 4
5 2 8 8
1 1 1 1
2 2 2 2
1 1 1 1
req
A B C D
1 1 1 1
2 2 2 2
1 1 1 1
i tried using pd.merge and inner/left on all columns but it is taking a lot more computational time and resource if the rows and columns are more. is there any other way to work it around like iterating through each row of dfA with dfB on all columns and then pick the ones which are there only in dfB?
You can use merge with ind parameter.
df_b.merge(df_a, on=['A','B','C','D'],
how='left', indicator='ind')\
.query('ind == "left_only"')\
.drop('ind', axis=1)
Output:
A B C D
9 1 1 1 1
10 2 2 2 2
11 1 1 1 1

Add an index column in csv file

l have the following sample to transform. After concatenating several csv files l keep the index of each row 0 up to last row of the file in each file as depicted below.
Column_1 column2
0 m 4
1 n 3
2 4 6
3 t 8
0 h 8
1 4 7
2 kl 8
3 m 4
4 bv 5
5 n 8
Now l want to add another column in the beginning indexing the file.
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
Simpliest is MultiIndex.from_arrays by numpy.arange or range:
print (np.arange(len(df.index)))
[0 1 2 3 4 5 6 7 8 9]
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([np.arange(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
n = ['a','b']
df.index = pd.MultiIndex.from_arrays([range(len(df.index)), df.index], names= n)
print (df)
Column_1 column2
a b
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8
If index names are not necessary, simply assign:
df.index = [np.arange(len(df.index)), df.index]
print (df)
Column_1 column2
0 0 m 4
1 1 n 3
2 2 4 6
3 3 t 8
4 0 h 8
5 1 4 7
6 2 kl 8
7 3 m 4
8 4 bv 5
9 5 n 8

Categories