I have the following dataframe:
import pandas as pd
df = pd.DataFrame({"A":['a', 's', 'd', 'f', 'g', 'h', 'j', 'k', 'l'], "M":[11,4,9,2,2,5,5,6,6]})
My goal is to remove all the rows having 2 consecutive values of column M not equal to each other.
Therefore row 0, 1 and 2 should be removed because the values of M are: 11!=4, 4!=9 and 9!=2). However if 2 rows have the same consecutive value the must be kept: row 3 and 4 must be kept because they both have value 2. Same reasoning for row 5 and 6 which have value 5.
I was able to reach my goal by using the following lines of code:
l=[]
for i, row in df.iterrows():
try:
if df["M"].iloc[i]!=df["M"].iloc[i+1] and df["M"].iloc[i]!=df["M"].iloc[i-1]:
l.append(i)
except:
pass
df = df.drop(df.index[l]).reset_index(drop=True)
Can you suggest a smarter and better way to achieve my goal? maybe by using some built-in pandas function?
Here is what the dataframe should look like:
Before:
A M
0 a 11 <----Must be removed
1 s 4 <----Must be removed
2 d 9 <----Must be removed
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
After
A M
0 f 2
1 g 2
2 h 5
3 j 5
4 k 6
5 l 6
Use boolean indexing with masks created by shift:
m = (df["M"].eq(df["M"].shift()) | df["M"].eq(df["M"].shift(-1)))
#alternative
#m = ~(df["M"].ne(df["M"].shift()) & df["M"].ne(df["M"].shift(-1)))
print (df[m])
A M
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
By using diff
df.loc[df.M.isin(df[df.M.diff()==0].M),:]
Out[140]:
A M
3 f 2
4 g 2
5 h 5
6 j 5
7 k 6
8 l 6
Notice Previous one may not work .(when 1,1,2,1,3,4)
m=df[df.M.diff()==0].index.values.tolist()
m.extend([x-1 for x in m])
df.loc[set(m)].sort_index()
Another nice answer from MaxU :
df.loc[df.M.diff().eq(0) | df.M.diff(-1).eq(0)]
Related
I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.
You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2
You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0
Consider a dataframe df with N columns and M rows:
>>> df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
>>> df
a b c d e
0 4 4 5 5 7
1 9 3 8 8 1
2 2 8 1 8 5
3 9 5 1 2 7
4 3 5 8 2 3
5 2 8 8 2 8
6 3 1 7 2 6
7 4 1 5 6 3
8 5 4 4 9 5
9 3 7 5 6 6
I want to randomly choose two columns and then randomly choose one particular row (this would give me two values of the same row). I can achieve this using
>>> df.sample(2, axis=1).sample(1,axis=0)
e a
1 3 5
I want to perform this K times like below :
>>> for i in xrange(5):
... df.sample(2, axis=1).sample(1,axis=0)
...
e a
1 3 5
d b
2 1 9
e b
4 8 9
c b
0 6 5
e c
1 3 5
I want to ensure that I do not choose the same two values (by choosing the same two columns and same row) in any of the trials. How would I achieve this?
I want to then perform a bitwise XOR operation on the two chosen values in each trial as well. For example, 3 ^ 5, 1 ^ 9 , .. and count all the bit differences in the chosen values.
You can create a list of all of the index by 2 column tuples. And then take random selections from that without replacement.
Sample Data
import pandas as pd
import numpy as np
from itertools import combinations, product
np.random.seed(123)
df = pd.DataFrame(np.random.randint(1, 10, (10, 5)), columns=list('abcde'))
#df = df.reset_index() #if index contains duplicates
Code
K = 5
choices = np.array(list(product(df.index, combinations(df.columns, 2))))
idx = choices[np.r_[np.random.choice(len(choices), K, replace=False)]]
#array([[9, ('a', 'e')],
# [2, ('a', 'e')],
# [1, ('a', 'c')],
# [3, ('b', 'e')],
# [8, ('d', 'e')]], dtype=object)
Then you can decide how exactly you want your output, but something like this is close to what you show:
pd.concat([df.loc[myid[0], list(myid[1])].reset_index().T for myid in idx])
# 0 1
#index a e
#9 4 8
#index a e
#2 1 1
#index a c
#1 7 1
#index b e
#3 2 3
#index d e
#8 5 7
How do I find the nth smallest number in a row, within a DataFrame, and add that value as an entry in a new column (because I would ultimately like to export the data).
Example Data
Setup
np.random.seed([3,14159])
df = pd.DataFrame(np.random.randint(10, size=(4, 5)), columns=list('ABCDE'))
A B C D E
0 4 8 1 1 9
1 2 8 1 4 2
2 8 2 8 4 9
3 4 3 4 1 5
In all of the following solutions, I assume n = 3
Solution 1
function prt below
Use np.partition to place smallest to the left of a partition and the largest to the right. Then take all to the left and find the max.
df.assign(nth=np.partition(df.values, 3, axis=1)[:, :3].max(1))
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 2
function srt below
More intuitive but more costly time complexity with np.sort
df.assign(nth=np.sort(df.values, axis=1)[:, 2])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Solution 3
function rnk below
Using pd.DataFrame.rank
Concise version that upcast to float
df.assign(nth=df.where(df.rank(1, method='first').eq(3)).stack().values)
A B C D E nth
0 4 8 1 1 9 4.0
1 2 8 1 4 2 2.0
2 8 2 8 4 9 8.0
3 4 3 4 1 5 4.0
Solution 4
function whr below
Using np.where and pd.DataFrame.rank
i, j = np.where(df.rank(1, method='first') == 3)
df.assign(nth=df.values[i, j])
A B C D E nth
0 4 8 1 1 9 4
1 2 8 1 4 2 2
2 8 2 8 4 9 8
3 4 3 4 1 5 4
Timing
Notice that srt is quickest but comparable to prt for a bit, then for larger number of columns, the more efficient algorithm of prt kicks in.
res.plot(loglog=True)
prt = lambda df, n: df.assign(nth=np.partition(df.values, n, axis=1)[:, :n].max(1))
srt = lambda df, n: df.assign(nth=np.sort(df.values, axis=1)[:, n - 1])
rnk = lambda df, n: df.assign(nth=df.where(df.rank(1, method='first').eq(n)).stack().values)
def whr(df, n):
i, j = np.where(df.rank(1, method='first').values == n)
return df.assign(nth=df.values[i, j])
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000, 10000],
columns='prt srt rnk whr'.split(),
dtype=float
)
for i in res.index:
num_rows = int(np.log(i))
d = pd.DataFrame(np.random.rand(num_rows, i))
for j in res.columns:
stmt = '{}(d, 3)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
You can do this as follows:
df.assign(nth=df.apply(lambda x: np.partition(x, nth)[nth], axis='columns'))
Example:
In[72]: df = pd.DataFrame(np.random.rand(3, 3), index=list('abc'), columns=[1, 2, 3])
In[73]: df
Out[73]:
1 2 3
a 0.436730 0.653242 0.843014
b 0.643496 0.854859 0.531652
c 0.831672 0.575336 0.517944
In[74]: df.assign(nth=df.apply(lambda x: np.partition(x, 1)[1], axis='columns'))
Out[74]:
1 2 3 nth
a 0.436730 0.653242 0.843014 0.653242
b 0.643496 0.854859 0.531652 0.643496
c 0.831672 0.575336 0.517944 0.575336
Here is a method that finds nth smallest item in a list:
def find_nth_in_list(list, n):
return sorted(list)[n-1]
The usage:
list =[10,5,7,9,8,4,6,2,1,3]
print(find_nth_in_list(list, 2))
Output:
2
You can give the row items as a list to this function.
EDIT
You can find rows with this function:
#Returns all rows as a list
def find_rows(df):
rows=[]
for row in df.iterrows():
index, data = row
rows.append(data.tolist())
return rows
Example usage:
rows = find_rows(df) #all rows as a list
smallest_3th = find_nth_in_list(rows[2], 3) #3rd row, 3rd smallest item
generate some random data
dd=pd.DataFrame(data=np.random.rand(7,3))
find minumum value per row using numpy
dd['minPerRow']=dd.apply(np.min,axis=1)
export results
dd['minPerRow'].to_csv('file.csv')
In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?
Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a
I want to append 2 dataframes:
data1:
a
1 a
2 b
3 c
4 d
5 e
data2:
b
1 f
2 g
3 h
4 i
5 j
output:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
currently i am using:
all_data= data1.append(data2, ignore_index=True)
this gives me result as:
a b
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
i.e. in different columns.
How can i get them in the same column?
Also tried converting the dataframes into list and then tring to append it. But it gave me the error:
TypeError: append() takes no keyword arguments
Also, is there any other function to remove duplicates from the datarame of strings? The drop_duplicates() function does not work in my case. The data still has duplicates.
You need to change one column name, so append can detect hat you want to do:
data2.columns = ["a"]
or
data1.columns = ["b"]
And then, after using data2.columns = ["a"]:
all_data = data1.append(data2, ignore_index=True)
all_data
a
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
And here you have your column named after the column's name of data1, which you can rename if you want:
all_data.columns = ["Foo"]
merge or concat work on keys. In this case, there are no common columns. However, why not use numpy append and create the dataframe?
In [68]: pd.DataFrame(pd.np.append(data1.values, data2.values), columns=['A'])
Out[68]:
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
df1.columns = ['b']
Out[78]:
b
0 a
1 b
2 c
3 d
4 e
pd.concat([df1 , df2] , ignore_index=True)
Out[80]:
b
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j