Consider a dataframe df with N columns and M rows:
>>> df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
>>> df
a b c d e
0 4 4 5 5 7
1 9 3 8 8 1
2 2 8 1 8 5
3 9 5 1 2 7
4 3 5 8 2 3
5 2 8 8 2 8
6 3 1 7 2 6
7 4 1 5 6 3
8 5 4 4 9 5
9 3 7 5 6 6
I want to randomly choose two columns and then randomly choose one particular row (this would give me two values of the same row). I can achieve this using
>>> df.sample(2, axis=1).sample(1,axis=0)
e a
1 3 5
I want to perform this K times like below :
>>> for i in xrange(5):
... df.sample(2, axis=1).sample(1,axis=0)
...
e a
1 3 5
d b
2 1 9
e b
4 8 9
c b
0 6 5
e c
1 3 5
I want to ensure that I do not choose the same two values (by choosing the same two columns and same row) in any of the trials. How would I achieve this?
I want to then perform a bitwise XOR operation on the two chosen values in each trial as well. For example, 3 ^ 5, 1 ^ 9 , .. and count all the bit differences in the chosen values.
You can create a list of all of the index by 2 column tuples. And then take random selections from that without replacement.
Sample Data
import pandas as pd
import numpy as np
from itertools import combinations, product
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
#df = df.reset_index() #if index contains duplicates
Code
K = 5
choices = np.array(list(product(df.index, combinations(df.columns, 2))))
idx = choices[np.r_[np.random.choice(len(choices), K, replace=False)]]
#array([[9, ('a', 'e')],
# [2, ('a', 'e')],
# [1, ('a', 'c')],
# [3, ('b', 'e')],
# [8, ('d', 'e')]], dtype=object)
Then you can decide how exactly you want your output, but something like this is close to what you show:
pd.concat([df.loc[myid[0], list(myid[1])].reset_index().T for myid in idx])
# 0 1
#index a e
#9 4 8
#index a e
#2 1 1
#index a c
#1 7 1
#index b e
#3 2 3
#index d e
#8 5 7
Related
I have a dataframe, that has a varying number of columns depending on my dataset. I want a function that will add up the combinations of these columns and append these new 'summed columns' to the existing dataframe.
For example if I have 3 columns, I want 3 more columns with 1 summed with 2, 1 summed with 3 and 3 summed with 2.
Much obliged.
IIUC, you can use itertools.combinations combined with pandas.concat:
from itertools import combinations
out = pd.concat({f'{a}+{b}': df[a]+df[b] for a,b in combinations(df, 2)}, axis=1)
Example:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({k: np.random.randint(0, 10, 5) for k in list('ABC')})
from itertools import combinations
out = pd.concat({f'{a}+{b}': df[a]+df[b] for a,b in combinations(df, 2)}, axis=1)
print(df.join(out))
output:
A B C A+B A+C B+C
0 5 9 7 14 12 16
1 0 3 6 3 6 9
2 3 5 8 8 11 13
3 3 2 8 5 11 10
4 7 4 1 11 8 5
I want to shift some columns in the middle of the dataframe to the rightmost.
I could do this with individual column using code:
cols=list(df.columns.values)
cols.pop(cols.index('one_column'))
df=df[cols +['one_column']]
df
But it's inefficient to do it individually when there are 100 columns of 2 series, ie. series1_1... series1_50 and series2_1... series2_50 in the middle of the dataframe.
How can I do it by assigning the 2 series as lists, popping them and putting them back? Maybe something like
cols=list(df.columns.values)
series1 = list(df.loc['series1_1':'series1_50'])
series2 = list(df.loc['series2_1':'series2_50'])
cols.pop('series1', 'series2')
df=df[cols +['series1', 'series2']]
but this didn't work. Thanks
If you just want to shift the columns, you could call concat like this:
cols_to_shift = ['colA', 'colB']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
Or, you could do a little list manipulation on the columns.
cols_to_keep = [c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
Minimal Example
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 6 1 4 4 8
1 4 6 3 5 8
2 7 9 9 2 7
cols_to_shift = ['B', 'C']
pd.concat([
df[df.columns.difference(cols_to_shift)],
df[cols_to_shift]
], axis=1
)
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
[c for c in df.columns if c not in cols_to_shift]
df[cols_to_keep + cols_to_shift]
A D E B C
0 6 4 8 1 4
1 4 5 8 6 3
2 7 2 7 9 9
I think list.pop only takes indices of the elements in the list.
You should list.remove instead.
cols = df.columns.tolist()
for s in (‘series1’, ‘series2’):
cols.remove(s)
df = df[cols + [‘series1’, ‘series2’]]
How do I find the nth smallest number in a row, within a DataFrame, and add that value as an entry in a new column (because I would ultimately like to export the data).
Example Data
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.randint(10, size=(4, 5)), columns=list('ABCDE'))
A B C D E
0 4 8 1 1 9
1 2 8 1 4 2
2 8 2 8 4 9
3 4 3 4 1 5
In all of the following solutions, I assume n = 3
Solution 1
function prt below
Use np.partition to place smallest to the left of a partition and the largest to the right. Then take all to the left and find the max.
df.assign(nth=np.partition(df.values, 3, axis=1)[:, :3].max(1))
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 2
function srt below
More intuitive but more costly time complexity with np.sort
df.assign(nth=np.sort(df.values, axis=1)[:, 2])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 3
function rnk below
Using pd.DataFrame.rank
Concise version that upcast to float
df.assign(nth=df.where(df.rank(1, method='first').eq(3)).stack().values)
A B C D E nth
0 4 8 1 1 9 4.0
1 2 8 1 4 2 2.0
2 8 2 8 4 9 8.0
3 4 3 4 1 5 4.0
Solution 4
function whr below
Using np.where and pd.DataFrame.rank
i, j = np.where(df.rank(1, method='first') == 3)
df.assign(nth=df.values[i, j])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Timing
Notice that srt is quickest but comparable to prt for a bit, then for larger number of columns, the more efficient algorithm of prt kicks in.
res.plot(loglog=True)
prt = lambda df, n: df.assign(nth=np.partition(df.values, n, axis=1)[:, :n].max(1))
srt = lambda df, n: df.assign(nth=np.sort(df.values, axis=1)[:, n - 1])
rnk = lambda df, n: df.assign(nth=df.where(df.rank(1, method='first').eq(n)).stack().values)
def whr(df, n):
i, j = np.where(df.rank(1, method='first').values == n)
return df.assign(nth=df.values[i, j])
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000],
columns='prt srt rnk whr'.split(),
dtype=float
)
for i in res.index:
num_rows = int(np.log(i))
d = pd.DataFrame(np.random.rand(num_rows, i))
for j in res.columns:
stmt = '{}(d, 3)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
You can do this as follows:
df.assign(nth=df.apply(lambda x: np.partition(x, nth)[nth], axis='columns'))
Example:
In[72]: df = pd.DataFrame(np.random.rand(3, 3), index=list('abc'), columns=[1, 2, 3])
In[73]: df
Out[73]:
1 2 3
a 0.436730 0.653242 0.843014
b 0.643496 0.854859 0.531652
c 0.831672 0.575336 0.517944
In[74]: df.assign(nth=df.apply(lambda x: np.partition(x, 1)[1], axis='columns'))
Out[74]:
1 2 3 nth
a 0.436730 0.653242 0.843014 0.653242
b 0.643496 0.854859 0.531652 0.643496
c 0.831672 0.575336 0.517944 0.575336
Here is a method that finds nth smallest item in a list:
def find_nth_in_list(list, n):
return sorted(list)[n-1]
The usage:
list =[10,5,7,9,8,4,6,2,1,3]
print(find_nth_in_list(list, 2))
Output:
2
You can give the row items as a list to this function.
EDIT
You can find rows with this function:
#Returns all rows as a list
def find_rows(df):
rows=[]
for row in df.iterrows():
index, data = row
rows.append(data.tolist())
return rows
Example usage:
rows = find_rows(df) #all rows as a list
smallest_3th = find_nth_in_list(rows[2], 3) #3rd row, 3rd smallest item
generate some random data
dd=pd.DataFrame(data=np.random.rand(7,3))
find minumum value per row using numpy
dd['minPerRow']=dd.apply(np.min,axis=1)
export results
dd['minPerRow'].to_csv('file.csv')
I am trying to select multiple columns in a Pandas dataframe in two different approaches:
1)via the columns number, for examples, columns 1-3 and columns 6 onwards.
and
2)via a list of column names, for instance:
years = list(range(2000,2017))
months = list(range(1,13))
years_month = list(["A", "B", "B"])
for y in years:
for m in months:
y_m = str(y) + "-" + str(m)
years_month.append(y_m)
Then, years_month would produce the following:
['A',
'B',
'C',
'2000-1',
'2000-2',
'2000-3',
'2000-4',
'2000-5',
'2000-6',
'2000-7',
'2000-8',
'2000-9',
'2000-10',
'2000-11',
'2000-12',
'2001-1',
'2001-2',
'2001-3',
'2001-4',
'2001-5',
'2001-6',
'2001-7',
'2001-8',
'2001-9',
'2001-10',
'2001-11',
'2001-12']
That said, what is the best(or correct) way to load only the columns in which the names are in the list years_month in the two approaches?
I think you need numpy.r_ for concanecate positions of columns, then use iloc for selecting:
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
and for second approach subset by list:
print (df[years_month])
Sample:
df = pd.DataFrame({'2000-1':[1,3,5],
'2000-2':[5,3,6],
'2000-3':[7,8,9],
'2000-4':[1,3,5],
'2000-5':[5,3,6],
'2000-6':[7,8,9],
'2000-7':[1,3,5],
'2000-8':[5,3,6],
'2000-9':[7,4,3],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
2000-1 2000-2 2000-3 2000-4 2000-5 2000-6 2000-7 2000-8 2000-9 A \
0 1 5 7 1 5 7 1 5 7 1
1 3 3 8 3 3 8 3 3 4 2
2 5 6 9 5 6 9 5 6 3 3
B C
0 4 7
1 5 8
2 6 9
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9
You can also sum of ranges (cast to list in python 3 is necessary):
rng = list(range(1,3)) + list(range(6, len(df.columns)))
print (rng)
[1, 2, 6, 7, 8, 9, 10, 11]
print (df.iloc[:, rng])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9
I’m not sure what exactly you are asking but in general DataFrame.loc allows you to select by label, DataFrame.iloc by index.
For example selecting columns # 0, 1 and 4:
dataframe.iloc[:, [0, 1, 4]]
and selecting columns labelled 'A', 'B' and 'C':
dataframe.loc[:, ['A', 'B', 'C']]
This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.