I have a pandas dataframe that looks like:
c1 c2 c3 c4 result
a b c d 1
b c d a 1
a e d b 1
g a f c 1
but I want to randomly select 50% of the rows to swap the order of and also flip the result column from 1 to 0 (as shown below):
c1 c2 c3 c4 result
a b c d 1
d a b c 0 (we swapped c3 and c4 with c1 and c2)
a e d b 1
f c g a 0 (we swapped c3 and c4 with c1 and c2)
What's the idiomatic way to accomplish this?
You had the general idea. Shuffle the DataFrame and split it in half. Then modify one half and join back.
import numpy as np
np.random.seed(410112)
dfs = np.array_split(df.sample(frac=1), 2) # Shuffle then split in 1/2
# On one half set result to 0 and swap the columns
dfs[1]['result'] = 0
dfs[1] = dfs[1].rename(columns={'c1': 'c2', 'c2': 'c1', 'c3': 'c4', 'c4': 'c3'})
# Join Back
df = pd.concat(dfs).sort_index()
c1 c2 c3 c4 result
0 a b c d 1
1 c b a d 0
2 e a b d 0
3 g a f c 1
Related
I have a DataFrame like so:
C1 C2 C3 C4
1 A B C E
2 C D E F
3 A C A B
4 A A B G
5 B nan C E
And a list:
filt = [A, B, C]
What I need is a filter that keeps only the rows that have all the values from filt, in any order or position. So output here would be:
C1 C2 C3 C4
1 A B C E
3 A C A B
I've looked at previous questions like Check multiple columns for multiple values and return a dataframe. In that case, however, the OP is only partially matching. In my case, all values must be present, in any order, for the row to be matched.
One solution
Use:
fs_filt = frozenset(filt)
mask = df.apply(frozenset, axis=1) >= fs_filt
res = df[mask]
print(res)
Output
C1 C2 C3 C4
0 A B C E
2 A C A B
The idea is to convert each row to a fronzenset and then verify if a fronzenset of filt is a subset (>=) of the elements of the row.
I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.
Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I have a DataFrame with a column that divides the data set into a set of categories. I would like to remove those categories that have a small number of observations.
Example
df = pd.DataFrame({'c': ['c1', 'c2', 'c1', 'c3', 'c4', 'c5', 'c2'], 'v': [5, 2, 7, 1, 2, 8, 3]})
c v
0 c1 5
1 c2 2
2 c1 7
3 c3 1
4 c4 2
5 c5 8
6 c2 3
For column c and n = 2, remove all the rows that have less than n same values in column c, resulting in:
c v
0 c1 5
1 c2 2
2 c1 7
3 c2 3
Using pd.Series.value_counts followed by Boolean indexing via pd.Series.isin:
counts = df['c'].value_counts() # create series of counts
idx = counts[counts < 2].index # filter for indices with < 2 counts
res = df[~df['c'].isin(idx)] # filter dataframe
print(res)
c v
0 c1 5
1 c2 2
2 c1 7
6 c2 3
by using groupby This can be achieved as below:
mask = df.groupby('c').count().reset_index()
mask = mask.loc[mask['v'] < 2]
res = df[~df.c.isin(mask.c.values)]
print(res)
output:
c v
0 c1 5
1 c2 2
2 c1 7
6 c2 3
Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).
Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4
I have a dateset like:
a b c
1 x1 c1
2 x2 c2
3 x3 c3
and I want to apply a function f only to the column b.
I did something like :
d2 = d['b'].apply(f)
But I have result like
a b
1 xt
2 xt
3 xt
And I want the column c, a result like :
a b c
1 xt c1
2 xt c2
3 xt c3
How can I do that without merge with the first dataset ?
I think you try don't use apply, because it is slower, better is use pandas API functions:
e.g. if need replace column to new constant values:
df['b'] = 'xt'
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
But if apply is necessary:
def f(x):
return 'xt'
df['b'] = df.b.apply(f)
print (df)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3
If you need new DataFrame, first use copy:
d = df.copy()
def f(x):
return 'xt'
d['b'] = d.b.apply(f)
print (d)
a b c
0 1 xt c1
1 2 xt c2
2 3 xt c3