Grouping a dataframe and applying tuple - python

I have a DataFrame of the following form:
df = pd.DataFrame({('a','A'):[3,4,5,6],
('a','B'):[1,1,3,5],
('b','A'):[9,7,0,3],
('b','B'):[2,0,1,6]
})
which looks like this:
a b
A B A B
0 3 1 9 2
1 4 1 7 0
2 5 3 0 1
3 6 5 3 6
I group it by the second level using the following command:
grouped = df.groupby(level=1,axis=1)
And get:
Group A
________
a b
A A
0 3 9
1 4 7
2 5 0
3 6 3
Group B
________
a b
B B
0 1 2
1 1 0
2 3 1
3 5 6
How can I take each group's two columns and put them into a tuple row-wise and convert that into a new DataFrame. Basically I'm trying to get at this:
A B
0 (3,9) (1,2)
1 (4,7) (1,0)
2 (5,0) (3,1)
3 (6,3) (5,6)
I've been trying
grouped.apply(lambda x : tuple(x))
But it doesn't do the job and instead gives me tuples of column names. Is there a simple way to do this without resorting to for loops?

Try
grouped.apply(lambda x: pd.Series([tuple(i) for i in x.values]))

This seems to do the trick:
grouped.apply(lambda x: pd.Series(list(x.itertuples(index=False))))

Related

Using head/tail function on pandas dataframe with different number of n for each group

I want to use head / tail function, but for each group i will take the different number of row according an input dictionary.
The function should have 2 input. First input is pandas dataframe
df = pd.DataFrame({"group":["A","A","A","B","B","B","B"],"value":[0,1,2,3,4,5,6,7]})
print(df)
group value
0 A 0
1 A 1
2 A 2
3 B 3
4 B 4
5 B 5
6 B 6
Second input is dict :
slice_per_group = {"A":1,"B":3}
Expected output :
df.groupby('group').head(slice_per_group) #Obviously this doesn't work
group value
0 A 0
3 B 3
4 B 4
5 B 5
Use head on each group separately:
df.groupby('group', group_keys=False).apply(lambda g: g.head(slice_per_group.get(g.name)))
group value
0 A 0
3 B 3
4 B 4
5 B 5

How to filter groupby object in pandas

I have df like this
A B
1 1
1 2
1 3
2 2
2 1
3 2
3 3
3 4
I would like to extract rows whose col B is not ascending like
A B
2 2
2 1
I tried
df.groupby("A").filter()...
But I stacked to extract.
If you have any solution,please let me know.
One way is to use pandas.Series.is_monotonic:
df[df.groupby('A')['B'].transform(lambda x:not x.is_monotonic)]
Output:
A B
3 2 2
4 2 1
Use GroupBy.transform with Series.diff and compare by Series.lt for at least one negative value with Series.any and filter by boolean indexing:
df1 = df[df.groupby('A')['B'].transform(lambda x: x.diff().lt(0).any())]
print (df1)
A B
3 2 2
4 2 1

Assign same random value to A-B , B-A pairs in python Dataframe

I have a Dataframe like
Sou Des
1 3
1 4
2 3
2 4
3 1
3 2
4 1
4 2
I need to assign random value for each pair between 0 and 1 but have to assign the same random value for both similar pairs like "1-3", "3-1" and other pairs. I'm expecting a result dataframe like
Sou Des Val
1 3 0.1
1 4 0.6
2 3 0.9
2 4 0.5
3 1 0.1
3 2 0.9
4 1 0.6
4 2 0.5
How to assign same random value similar pairs like "A-B" and "B-A" in python pandas .
Let's create first a sorted by axis=1 helper DF:
In [304]: x = pd.DataFrame(np.sort(df, axis=1), df.index, df.columns)
In [305]: x
Out[305]:
Sou Des
0 1 3
1 1 4
2 2 3
3 2 4
4 1 3
5 2 3
6 1 4
7 2 4
now we can group by its columns:
In [306]: df['Val'] = (x.assign(c=1)
.groupby(x.columns.tolist())
.transform(lambda x: np.random.rand(1)))
In [307]: df
Out[307]:
Sou Des Val
0 1 3 0.989035
1 1 4 0.918397
2 2 3 0.463653
3 2 4 0.313669
4 3 1 0.989035
5 3 2 0.463653
6 4 1 0.918397
7 4 2 0.313669
This is new way
s=pd.crosstab(df.Sou,df.Des)
b = np.random.random_integers(-2000,2000,size=(len(s),len(s)))
sy = (b + b.T)/2
s.mul(sy).replace(0,np.nan).stack().reset_index()
Out[292]:
Sou Des 0
0 1 3 -60.0
1 1 4 -867.0
2 2 3 269.0
3 2 4 1152.0
4 3 1 -60.0
5 3 2 269.0
6 4 1 -867.0
7 4 2 1152.0
The trick here is to do a bit of work away from the dataframe. You can break this down into three steps:
assemble a list of all tuples (a,b)
assign a random value to each pair so that (a,b) and (b,a) have the same value
fill in the new column
Assuming your dataframe is called df, we can make a list of all the pairs ordered so that a <= b. I think this will be easier than trying to keep track of both (a,b) and (b,a).
pairs = set([(a,b) if a <= b else (b,a)
for a, b in df.itertuples(index=False,name=None))
It's simple enough to assign a random number to each of these pairs and store it in a dictionary, so I'll leave that to you. Call it pair_dict.
Now, we just have to lookup the values. We'll ultimately want to write
df['Val'] = df.apply(<some function>, axis=1)
where our function looks up the appropriate value in pair_dict.
Rather than try to cram it into a lambda (though we could), let's write it separately.
def func(row):
if row['Sou'] <= row['Des']:
key = (row['Sou'], row['Des'])
else:
key = (row['Des'], row['Sou'])
return pair_dict[key]
if you are ok having the "random" value coming from the hash() method you can achieve with frozenset()
df = pd.DataFrame([[1,1,2,2,3,3,4,4],[3,4,3,4,1,2,1,2]]).T
df.columns = ['Sou','Des']
df['Val']= df.apply(lambda x: hash(frozenset([x["Sou"],x["Des"]])),axis=1)
print df
which gives:
Sou Des Val
0 1 3 1580307032
1 1 4 -1736016661
2 2 3 741508915
3 2 4 -1930135584
4 3 1 1580307032
5 3 2 741508915
6 4 1 -1736016661
7 4 2 -1930135584
reference:
Why aren't Python sets hashable?

Pandas combine 2 Dataframes and overwrite values

I've looked into pandas join, merge, concat with different param values (how to join, indexing, axis=1, etc) but nothing solves it!
I have two dataframes:
x = pd.DataFrame(np.random.randn(4,4))
y = pd.DataFrame(np.random.randn(4,4),columns=list(range(2,6)))
x
Out[67]:
0 1 2 3
0 -0.036327 -0.594224 0.469633 -0.649221
1 1.891510 0.164184 -0.010760 -0.848515
2 -0.383299 1.416787 0.719434 0.025509
3 0.097420 -0.868072 -0.591106 -0.672628
y
Out[68]:
2 3 4 5
0 -0.328402 -0.001436 -1.339613 -0.721508
1 0.408685 1.986148 0.176883 0.146694
2 -0.638341 0.018629 -0.319985 -1.832628
3 0.125003 1.134909 0.500017 0.319324
I'd like to combine to one dataframe where the values from y in columns 2 and 3 overwrite those of x and then columns 4 and 5 are inserted on the end:
new
Out[100]:
0 1 2 3 4 5
0 -0.036327 -0.594224 -0.328402 -0.001436 -1.339613 -0.721508
1 1.891510 0.164184 0.408685 1.986148 0.176883 0.146694
2 -0.383299 1.416787 -0.638341 0.018629 -0.319985 -1.832628
3 0.097420 -0.868072 0.125003 1.134909 0.500017 0.319324
You can try combine_first:
df = y.combine_first(x)
You need update and combine_first
x.update(y)
x.combine_first(y)
Out[1417]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027
Or you using pd.concat + intersection
pd.concat([x.drop(x.columns.intersection(y.columns),1),y],1)
Out[1432]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027

Applying operations on groups without aggregating

I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2

Categories