Pandas sum integers separated by commas in a string column - python

I have a pandas data frame with a column as type string, looking like:
1 1
2 3,1
3 1
4 1
5 2,1,2
6 1
7 1
8 1
9 1
10 4,3,1
I want to sum all integers separated by the commas, obtaining as a result:
1 1
2 4
3 1
4 1
5 5
6 1
7 1
8 1
9 1
10 8
My attempt so far has been:
qty = []
for i in df['Qty']:
i = i.split(",")
i = sum(i)
qty.append(i)
df['Qty'] = qty
Although, I get the error:
TypeError: cannot perform reduce with flexible type

Use apply on column to do df['B'].apply(lambda x: sum(map(int, x.split(','))))
In [81]: df
Out[81]:
A B
0 1 1
1 2 3,1
2 3 1
3 4 1
4 5 2,1,2
5 6 1
6 7 1
7 8 1
8 9 1
9 10 4,3,1
In [82]: df['B'].apply(lambda x: sum(map(int, x.split(','))))
Out[82]:
0 1
1 4
2 1
3 1
4 5
5 1
6 1
7 1
8 1
9 8
Name: B, dtype: int64

Related

Grid-like dataframe to list

I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8

How to renumber a dataframe according to a periodic and successive column?

The original dataframe df is:
type month
0 a 1
1 b 1
2 c 1
3 e 5
4 a 5
5 c 5
6 b 9
7 e 9
8 a 9
9 e 9
10 a 1
11 a 1
Notice that the month is arranged in successive segments and repeated periodically. The size of the segments is not always the same. I would like to add a column num, for each successive month, renumbered from 0 again. The order of the original sequence should not be changed. The expected output should be:
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1
I can't use groupby since the values of month are repeated but separated.
First we create the groups with checking if the next row is equal to the previous row with Series.shift and then cumsum the booleans.
Then we groupby on the groups and use cumcount
grps = df['month'].ne(df['month'].shift()).cumsum()
df['num'] = df.groupby(grps).cumcount()
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1

Python dataframe add columns in groups of 3

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

Operations with different parts of the same dataframe

Assume there is a dataframe:
kind value
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10
We can do something with a filtered part of a dataframe:
df.loc[df['kind']==1, 'value'] = df.loc[df['kind']==1, 'value'] * 2
How to perform a calculation involving two or more parts of the same dataframe, assuming their size is equal? Something like this:
df.loc[df['kind']==1, 'value'] =
df.loc[df['kind']==1, 'value'] * df.loc[df['kind']==2, 'value']
(this doesn't work)
Try this:
In [107]: df.loc[df['kind']==1, 'value'] *= df.loc[df['kind']==2, 'value'].values
In [108]: df
Out[108]:
kind value
0 1 6
1 1 14
2 1 24
3 1 36
4 1 50
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10
Use:
m = df['kind']==1
df.loc[m, 'value'] = df.loc[m, 'value'].values * df.loc[df['kind']==2, 'value'].values
print (df)
kind value
0 1 6
1 1 14
2 1 24
3 1 36
4 1 50
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10

pandas python sorting according to a pattern

I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]

Categories