Question about conditional calculation in pandas - python

I have this formula, I wanted to turn this into pandas calculation,
the formula is very easy:
NEW = A(where v=1) + A(where v=3) + A(where v=5)
I have a data frame like this:
Type subType value A NEW
X a 1 3 =3+9+9=21
X a 3 9
X a 5 9
X b 1 4 =4+5+0=9
X b 3 5
X b 5 0
Y a 1 1 =1+2+3=6
Y a 3 2
Y a 5 3
Y b 1 4 =4+5+2=11
Y b 3 5
Y b 5 2
Two questions:
I know I can just write down the calculation with the specified cell, but I want the code looks nicer, is there other ways to get the value?
Because there will be only two results for X & Y, how can I add them into my original dataframe for further calculation? (my thought is not to add them in the dataframe and just use the value whenever it's necessary for future calculation)
Quite new to coding, any answer will be appreciated!

Try this:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','Y','Y','Y'], 'value':[1,3,5,1,3,5], 'A':[3,9,4,0,2,2]})
>>> df
Type value A
0 X 1 3
1 X 3 9
2 X 5 4
3 Y 1 0
4 Y 3 2
5 Y 5 2
>>> df.groupby('Type')['A'].sum()
Type
X 16
Y 4
>>> ur_dict = df.groupby('Type')['A'].sum().to_dict()
>>> df['NEW'] = df['Type'].map(ur_dict)
>>> df
Type value A NEW
0 X 1 3 16
1 X 3 9 16
2 X 5 4 16
3 Y 1 0 4
4 Y 3 2 4
5 Y 5 2 4
Hope this helps.
Edit to answer additional inquiry:
You are mapping tuple keys to a series, that will give you an error. You should shift the columns you need to map your dictionary into as index before doing the mapping.
See below:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','X','X','X','Y','Y','Y','Y','Y','Y'], 'subType':['a','a','a','b','b','b','a','a','a','b','b','b'],'value':[1,3,5,1,3,5,1,3,5,1,3,5],'A':[3,9,9,4,5,0,1,2,3,4,5,2]})
>>> df
Type subType value A
0 X a 1 3
1 X a 3 9
2 X a 5 9
3 X b 1 4
4 X b 3 5
5 X b 5 0
6 Y a 1 1
7 Y a 3 2
8 Y a 5 3
9 Y b 1 4
10 Y b 3 5
11 Y b 5 2
>>> df.groupby(['Type', 'subType'])['A'].sum()
Type subType
X a 21
b 9
Y a 6
b 11
Name: A, dtype: int64
>>> ur_dict = df.groupby(['Type', 'subType'])['A'].sum().to_dict()
>>> ur_dict
{('X', 'a'): 21, ('X', 'b'): 9, ('Y', 'a'): 6, ('Y', 'b'): 11}
>>> df['NEW'] = df.set_index(['Type', 'subType']).index.map(ur_dict)
>>> df
Type subType value A NEW
0 X a 1 3 21
1 X a 3 9 21
2 X a 5 9 21
3 X b 1 4 9
4 X b 3 5 9
5 X b 5 0 9
6 Y a 1 1 6
7 Y a 3 2 6
8 Y a 5 3 6
9 Y b 1 4 11
10 Y b 3 5 11
11 Y b 5 2 11

Related

pandas-how can I replace rows in a dataframe

I am new in Python and try to replace rows.
I have a dataframe such as:
X
Y
1
a
2
d
3
c
4
a
5
b
6
e
7
a
8
b
I have two question:
1- How can I replace 2nd row with 5th, such as:
X
Y
1
a
5
b
3
c
4
a
2
d
6
e
7
a
8
b
2- How can I put 6th row above 3rd row, such as:
X
Y
1
a
2
d
6
e
3
c
4
a
5
b
7
a
8
b
First use DataFrame.iloc, python counts from 0, so for select second row use 1 and for fifth use 4:
df.iloc[[1, 4]] = df.iloc[[4, 1]]
print (df)
X Y
0 1 a
1 5 b
2 3 c
3 4 a
4 2 d
5 6 e
6 7 a
7 8 b
And then rename indices for above value, here 1 and sorting with only stable sorting mergesort:
df = df.rename({5:1}).sort_index(kind='mergesort', ignore_index=True)
print (df)
X Y
0 1 a
1 2 d
2 6 e
3 3 c
4 4 a
5 5 b
6 7 a
7 8 b

Sort a subset of columns for rows matching condition

My DataFrame looks like this:
a b c d e f g
0 x y 1 3 4 5 6
1 x y -1 7 8 5 6
2 x y -1 7 8 3 4
For rows where df.c == -1 I would like to sort all the columns between df.d and df.g in ascending order.
The result would be:
a b c d e f g
0 x y 1 3 4 5 6
1 x y -1 5 6 7 8
2 x y -1 3 4 7 8
I tried several things but none seemed to work:
for row in df.itertuples():
if row.c == -1:
subset = row[4:]
sorted = sorted(subset)
df.replace(to_replace=subset, value= sorted)
and also
df.loc[df.c == -1, df[4:]] = sorted(df[4:])
You can use numpy.sort on the region of interest.
mask = df.c.eq(-1), slice('d', 'g')
df.loc[mask] = np.sort(df.loc[mask].values)
df
# a b c d e f g
# 0 x y 1 3 4 5 6
# 1 x y -1 5 6 7 8
# 2 x y -1 3 4 7 8
Probably not the fastest, but this works:
rmask = df.c == -1
cmask = ['d', 'e', 'f', 'g']
df.loc[rmask, cmask] = df.loc[rmask, cmask].apply(lambda row: sorted(row), axis=1)
df
a b c d e f g
0 x y 1 3 4 5 6
1 x y -1 5 6 7 8
2 x y -1 3 4 7 8

Pandas inner merge/join returning all rows

I'm trying to merge two data frames based on a column present in both, keeping only the intersection of the two sets.
The desired result is:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0
f 9 0
My code that does not produce the desired result is:
pd.merge(foo, bar, how='inner', on='x')
Instead, the code seems to return:
foo bar foobar
x y z x j i x y z j i
a 1 2 a 9 0 a 1 2 9 0
b 3 4 b 9 0 b 3 4 9 0
c 5 6 c 9 0 c 5 6 9 0
d 7 8 e 9 0 e * * 9 0
f 9 0 f * * 9 0
(where * represents an NaN)
Where am I going wrong? I've already reached the third Google page trying to fix this an nothing works. Whatever I do I get an outer join, with all rows in both sets.
Usually it means that you have duplicates in the column(s) used for joining, resulting in cartesian product.
Demo:
In [35]: foo
Out[35]:
x y z
0 a 1 2
1 b 3 4
2 c 5 6
3 d 7 8
In [36]: bar
Out[36]:
x j i
0 a 9 0
1 b 9 0
2 a 9 0
3 a 9 0
4 b 9 0
In [37]: pd.merge(foo, bar)
Out[37]:
x y z j i
0 a 1 2 9 0
1 a 1 2 9 0
2 a 1 2 9 0
3 b 3 4 9 0
4 b 3 4 9 0

How to Pandas read_csv multiple records per line

(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:
i A B C
x y z x y z x y z
-------------------------------------
1 1 2 3 4 5 6 7 8 9
2 1 2 3 3 2 1 2 1 3
3 9 8 7 6 5 4 3 2 1
where A, B, C are categorical and the properties x, y, z are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:
i id GRP x y z
-----------------------
1 1 A 1 2 3
2 1 B 4 5 6
3 1 C 7 8 9
4 2 A 1 2 3
5 2 B 3 2 1
6 2 C 2 1 3
7 3 A 9 8 7
8 3 B 6 5 4
9 3 C 3 2 1
So how best to accomplish this?
#1: I thought about reading the file using basic read_csv() options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A,B,C) and properties (x,y,z) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.
#2: I like the idea of setting the iterator param in read_csv() and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.
What's the smarter way to do this?
First I constructed the sample dataframe like yours:
column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])
df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
columns=column, index=[1, 2, 3])
print df
# A B C
# x y z x y z x y z
# 1 5 7 4 7 7 8 9 1 9
# 2 8 5 1 8 5 9 4 4 2
# 3 4 9 6 2 1 4 6 1 6
To get your desired output, reshape the dataframe using df.stack() and then reset the index:
df = df.stack(0).reset_index()
df.index += 1 # to make index begin from 1
print df
# level_0 level_1 x y z
# 1 1 A 5 7 4
# 2 1 B 7 7 8
# 3 1 C 9 1 9
# 4 2 A 8 5 1
# 5 2 B 8 5 9
# 6 2 C 4 4 2
# 7 3 A 4 9 6
# 8 3 B 2 1 4
# 9 3 C 6 1 6
Then you can just rename the columns as you want. Hope it helps.

Opposite of melt in python pandas

I cannot figure out how to do "reverse melt" using Pandas in python.
This is my starting data
import pandas as pd
from StringIO import StringIO
origin = pd.read_table(StringIO('''label type value
x a 1
x b 2
x c 3
y a 4
y b 5
y c 6
z a 7
z b 8
z c 9'''))
origin
Out[5]:
label type value
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
This is the output I would like to have:
label a b c
x 1 2 3
y 4 5 6
z 7 8 9
I'm sure there is an easy way to do this, but I don't know how.
there are a few ways;
using .pivot:
>>> origin.pivot(index='label', columns='type')['value']
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]
using pivot_table:
>>> origin.pivot_table(values='value', index='label', columns='type')
value
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]
or .groupby followed by .unstack:
>>> origin.groupby(['label', 'type'])['value'].aggregate('mean').unstack()
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[3 rows x 3 columns]
DataFrame.set_index + DataFrame.unstack
df.set_index(['label','type'])['value'].unstack()
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
simplifying the passing of pivot arguments
df.pivot(*df)
type a b c
label
x 1 2 3
y 4 5 6
z 7 8 9
[*df]
#['label', 'type', 'value']
For expected output we need DataFrame.reset_index and DataFrame.rename_axis
df.pivot(*df).rename_axis(columns = None).reset_index()
label a b c
0 x 1 2 3
1 y 4 5 6
2 z 7 8 9
if there are duplicates in a,b columns we could lose information so we need GroupBy.cumcount
print(df)
label type value
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
0 x a 1
1 x b 2
2 x c 3
3 y a 4
4 y b 5
5 y c 6
6 z a 7
7 z b 8
8 z c 9
df.pivot_table(index = ['label',
df.groupby(['label','type']).cumcount()],
columns = 'type',
values = 'value')
type a b c
label
x 0 1 2 3
1 1 2 3
y 0 4 5 6
1 4 5 6
z 0 7 8 9
1 7 8 9
Or:
(df.assign(type_2 = df.groupby(['label','type']).cumcount())
.set_index(['label','type','type_2'])['value']
.unstack('type'))

Categories