I have a pandas data frame with a column as type string, looking like:
1 1
2 3,1
3 1
4 1
5 2,1,2
6 1
7 1
8 1
9 1
10 4,3,1
I want to sum all integers separated by the commas, obtaining as a result:
1 1
2 4
3 1
4 1
5 5
6 1
7 1
8 1
9 1
10 8
My attempt so far has been:
qty = []
for i in df['Qty']:
i = i.split(",")
i = sum(i)
qty.append(i)
df['Qty'] = qty
Although, I get the error:
TypeError: cannot perform reduce with flexible type
Use apply on column to do df['B'].apply(lambda x: sum(map(int, x.split(','))))
In [81]: df
Out[81]:
A B
0 1 1
1 2 3,1
2 3 1
3 4 1
4 5 2,1,2
5 6 1
6 7 1
7 8 1
8 9 1
9 10 4,3,1
In [82]: df['B'].apply(lambda x: sum(map(int, x.split(','))))
Out[82]:
0 1
1 4
2 1
3 1
4 5
5 1
6 1
7 1
8 1
9 8
Name: B, dtype: int64
Related
I have an excel dataset which contains 100 rows and 100 clolumns with order frequencies in locations described by x and y.(gird like structure)
I'd like to convert it to the following structure with 3 columns:
x-Coördinaten | y-Coördinaten | value
The "value" column only contains positive integers. The x and y column contain float type data (geograohical coordinates.
The order does not matter, as it can easily be sorted afterwards.
So,basicly a merge of lists could work, e.g.:
[[1,5,3,5], [4,2,5,6], [2,3,1,5]] ==> [1,5,3,5,4,2,5,6,2,3,1,5]
But then i would lose the location...which is key for my project.
What is the best way to accomplish this?
Assuming this input:
l = [[1,5,3,5],[4,2,5,6],[2,3,1,5]]
df = pd.DataFrame(l)
you can use stack:
df2 = df.rename_axis(index='x', columns='y').stack().reset_index(name='value')
output:
x y value
0 0 0 1
1 0 1 5
2 0 2 3
3 0 3 5
4 1 0 4
5 1 1 2
6 1 2 5
7 1 3 6
8 2 0 2
9 2 1 3
10 2 2 1
11 2 3 5
or melt for a different order:
df2 = df.rename_axis('x').reset_index().melt('x', var_name='y', value_name='value')
output:
x y value
0 0 0 1
1 1 0 4
2 2 0 2
3 0 1 5
4 1 1 2
5 2 1 3
6 0 2 3
7 1 2 5
8 2 2 1
9 0 3 5
10 1 3 6
11 2 3 5
You should be able to get the results with a melt operation -
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [2, 3, 4]
df.loc[:, 'x'] = [3, 4, 5]
This is what df looks like
2 3 4 x
0 0 1 2 3
1 3 4 5 4
2 6 7 8 5
The melt operation -
df.melt(id_vars='x', var_name='y')
output -
x y value
0 3 2 0
1 4 2 3
2 5 2 6
3 3 3 1
4 4 3 4
5 5 3 7
6 3 4 2
7 4 4 5
8 5 4 8
The original dataframe df is:
type month
0 a 1
1 b 1
2 c 1
3 e 5
4 a 5
5 c 5
6 b 9
7 e 9
8 a 9
9 e 9
10 a 1
11 a 1
Notice that the month is arranged in successive segments and repeated periodically. The size of the segments is not always the same. I would like to add a column num, for each successive month, renumbered from 0 again. The order of the original sequence should not be changed. The expected output should be:
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1
I can't use groupby since the values of month are repeated but separated.
First we create the groups with checking if the next row is equal to the previous row with Series.shift and then cumsum the booleans.
Then we groupby on the groups and use cumcount
grps = df['month'].ne(df['month'].shift()).cumsum()
df['num'] = df.groupby(grps).cumcount()
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1
I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42
Assume there is a dataframe:
kind value
0 1 1
1 1 2
2 1 3
3 1 4
4 1 5
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10
We can do something with a filtered part of a dataframe:
df.loc[df['kind']==1, 'value'] = df.loc[df['kind']==1, 'value'] * 2
How to perform a calculation involving two or more parts of the same dataframe, assuming their size is equal? Something like this:
df.loc[df['kind']==1, 'value'] =
df.loc[df['kind']==1, 'value'] * df.loc[df['kind']==2, 'value']
(this doesn't work)
Try this:
In [107]: df.loc[df['kind']==1, 'value'] *= df.loc[df['kind']==2, 'value'].values
In [108]: df
Out[108]:
kind value
0 1 6
1 1 14
2 1 24
3 1 36
4 1 50
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10
Use:
m = df['kind']==1
df.loc[m, 'value'] = df.loc[m, 'value'].values * df.loc[df['kind']==2, 'value'].values
print (df)
kind value
0 1 6
1 1 14
2 1 24
3 1 36
4 1 50
5 2 6
6 2 7
7 2 8
8 2 9
9 2 10
I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]