from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.
Related
I have a pandas DataFrame containing data and another DataFrame where each row can be interpreted as a filter for the data:
data_df = pd.DataFrame([{'a':i%10, 'b':i%15} for i in range(30)])
filter_df = pd.DataFrame({'a':[3,4,5], 'b0':[5,6,8], 'b1':[15,10,11]})
filter_df
a b0 b1
0 3 5 15
1 4 6 10
2 5 8 11
Would mean
pd.concat([
data_df[(data_df.a==3) & data_df.b.between(5,15)],
data_df[(data_df.a==4) & data_df.b.between(6,10)],
data_df[(data_df.a==5) & data_df.b.between(8,11)]
])
Now what I need is a way of applying all these filters to the data_df and have the resulting DataFrame. One way to do this is with apply:
res = filter_df.apply(lambda x: data_df[(data_df.a==x['a']) & data_df.b.between(x['b0'], x['b1'])], axis=1)
res = pd.concat([x for x in res])
Notice that for this to work, I have to concat a list of the results, because the result is a Series containing return value for each line, which might be None, a pd.Series or a pd.DataFrame.
Is there a "better" way to do this? I'm hoping for something like .reset_index(), but it seems I'm not able to find the correct way.
Also, if there is a more elegant/different way than apply I'd be happy. In reality the data_df will be in hundreds of thousends or millions of rows while the filter_df I expect to be below 1000 rows, but most of the time more than 10, if that makes a difference for performance
You can merge and query:
data_df.merge(filter_df, on='a', how='right').query('b0 <= b <= b1')
Or equivalently, merge and loc filter:
(data_df.merge(filter_df, on='a', how='right')
.loc[lambda x: x['b'].between(x['b0'], x['b1'])]
)
Output:
a b b0 b1
1 3 13 5 15
2 3 8 5 15
5 4 9 6 10
8 5 10 8 11
You can use boolean indexing:
d = filter_df.set_index('a')
# is "a" in filter_df's a?
m1 = data_df['a'].isin(filter_df['a'])
# is b ≥ the matching b0 value in filter_df?
m2 = data_df['b'].ge(data_df['a'].map(d['b0']))
# is b ≤ the matching b1 value in filter_df?
m3 = data_df['b'].le(data_df['a'].map(d['b1']))
# keep if all conditions are True
data_df[m1&m2&m3]
output:
a b
13 3 13
23 3 8
24 4 9
25 5 10
This question already has answers here:
How to reorder indexed rows based on a list in Pandas data frame
(3 answers)
Closed 10 months ago.
Hi I am trying to sort the existing columns in pandas by another.
Here's the output of my pandas dataset
d_id a b c d sort_d_id
3 1 1 0 0 53
21 0 0 0 1 32
32 0 1 0 1 32
32 0 1 0 1 3
53 0 0 1 0 21
basically, I want to sort d_id by sort_d_did.
The outcome I want is
d_id a b c d sort_d_id
53 0 0 1 0 53
32 0 1 0 1 32
32 0 1 0 1 32
3 1 1 0 0 3
21 0 0 0 1 21
Can you please tell me how to do this?
Here's how to do it:
# Get the d_id and sort_d_id columns for easy handling and rearranging
sort_d_id = df["sort_d_id"].tolist()
d_id = df["d_id"].tolist()
# Also get the row indices (i.e. 0 to 4) in order.
indices = list(range(len(d_id)))
# The 1st column of arr will be indices, 2nd column is sort_d_id, 3rd column is d_id.
arr = np.array([indices, sort_d_id, d_id]).T
# Sort columns according to sort_d_id.
arr = arr[np.argsort(arr[:, 1])]
# The order is now the d_id indices mapped to the sort_id_indices
# e.g. d_id[0] will become d_id[3], d_id[1] will become d_id[4], etc.
new_order = arr[:, 0]
# But we don't want that; we want sort_d_id mapped to the d_id indices.
# So we do another reordering and get the new indices.
new_order_zipped = sorted(zip(new_order, indices))
final_order = list(map(list, zip(*new_order_zipped)))[1]
# Reorder by the final indices.
df = df.reindex(final_order)
# Replace the sort_d_id column which got messed up during reordering.
df["sort_d_id"] = sort_d_id
# Reset indices if you want
df = df.reset_index(drop=True)
The resulting df now looks how it should. It's rather confusing, I know, but it does work as it should.
The only way I can figure to get your desired output is:
df_1 = df['sort_d_id'].sort_values(ascending=False, key=lambda x: x.astype('string'))
df_2 = df.drop('sort_d_id', axis=1).sort_values('d_id', ascending=False, key=lambda x: x.astype('string')).reset_index(drop=True)
df = pd.concat([df_1, df_2], axis=1,)[df.columns]
print(df)
Output:
d_id a b c d sort_d_id
0 53 0 0 1 0 53
1 32 0 1 0 1 32
2 32 0 1 0 1 32
3 3 1 1 0 0 3
4 21 0 0 0 1 21
I am trying to find min of values across columns in a pandas data frame where cols are ranged and split. For example, I have the dataframe in pandas as shown in the image.
I am iterating over the dataframe for more logic and would like to get the min of values in columns between T3:T6 and T11:T14 in separate variables.
Tried print(df.iloc[2,2:,2:4].min(axis=1))
I expect 9 and 13 for Row1 when I iterate.
create a simple dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
A B C D
0 2 0 5 1
1 9 7 5 5
2 5 5 3 0
3 0 6 3 8
4 4 4 4 0
5 8 2 1 4
6 4 1 1 8
7 6 5 2 9
8 2 4 3 0
9 4 7 1 8
use the min() function:
df.min()
result:
A 0
B 0
C 1
D 0
and if you wish to select specific columns, use the loc:
df.loc[:,'B':'C'].min()
B 0
C 1
Bonus: Take pandas to another level - paint the minimum:
df.style.apply(lambda x: ['background-color : red; font-size: 16px' if v==x.min() else 'font-size: 16px' for _,v in enumerate(x) ],axis=0)
print(df[['T'+str(x) for x in range(3,7)]].min(axis=1)]
print(df[['T'+str(x) for x in range(11,15)]].min(axis=1)]
should print the mins for all the rows of t3, t4, t5,16 and t11, t12, t13,14 separately
For test dataframe:
df = pd.DataFrame({'A':[x for x in range(100)], 'B': [x for x in range(10,110)], 'C' : [x for x in range(20,120)] })
Create a function that can be applied to each row to find the minimum:
def test(row):
print(row[['A','B']].min())
Then use apply to run the function on each row:
df.apply(lambda row: test(row), axis=1)
This will print the minimum of whichever columns you put in the "test function"
I'm new to pandas so apologies for what I think is a trivial question, but I can't quite find the relevant function for this:
I've got a file which consists of essentially 12 different data series, with the nth element of each series grouped together; i.e.
series_A_data0
series_B_data0
series_C_data0
...
series_L_data0
series_A_data1
series_B_data1
series_C_data1
...
I can import this into pandas as a single column data frame, but how can I get it into a 12-column data series?
For reference, currently I'm doing:
data = pd.read_csv(file)
data.head(14)
0 17655029760
1 1529585664
2 1598763008
3 4936196096
4 2192232448
5 2119827456
6 2143997952
7 1549099008
8 1593683968
9 1361498112
10 1514512384
11 1346588672
12 17939451904
13 1544957952
Do you know that the series will always be in the same order? If so, I'd create a MultiIndex, and the unstack from that. Just read in the Series like you've done. I'll work with this data frame:
In [31]: df = pd.DataFrame(np.random.randn(24))
In [32]: df
Out[32]:
0
0 -1.642765
1 1.369409
2 -0.732588
3 0.357242
4 -1.259126
5 0.851803
6 -1.582394
7 -0.508507
8 0.123032
9 0.421857
10 -0.524147
11 0.381085
12 1.286025
13 -0.983004
14 0.813764
15 -0.203370
16 -1.107230
17 1.855278
18 -2.041401
19 1.352107
20 -1.630252
21 -0.326678
22 -0.080991
23 0.438606
In [33]: import itertools as it
In [34]: series_id = it.cycle(list('abcdefghijkl')) # first 12 letters.
In [60]: idx = pd.MultiIndex.from_tuples(zip(series_id, df.index.repeat(12)[:len(df)]))
We need to repeat the index so that the first observation for each Series is at index 0. Now set that as the index and unstack.
In [61]: df.index = idx
In [62]: df
Out[62]:
0
a 0 -1.642765
b 0 1.369409
c 0 -0.732588
d 0 0.357242
e 0 -1.259126
f 0 0.851803
g 0 -1.582394
h 0 -0.508507
i 0 0.123032
j 0 0.421857
k 0 -0.524147
l 0 0.381085
a 1 1.286025
b 1 -0.983004
c 1 0.813764
d 1 -0.203370
e 1 -1.107230
f 1 1.855278
g 1 -2.041401
h 1 1.352107
i 1 -1.630252
j 1 -0.326678
k 1 -0.080991
l 1 0.438606
[24 rows x 1 columns]
In [74]: df.unstack(0)[0]
Out[74]:
a b c d e f g \
0 -1.642765 1.369409 -0.732588 0.357242 -1.259126 0.851803 -1.582394
1 1.286025 -0.983004 0.813764 -0.203370 -1.107230 1.855278 -2.041401
h i j k l
0 -0.508507 0.123032 0.421857 -0.524147 0.381085
1 1.352107 -1.630252 -0.326678 -0.080991 0.438606
[2 rows x 12 columns]
The unstack(0) say to move the outer index labels to the columns.
I don't know if there is a simpler method, but if you can construct a comparable series with the desired column names and index values, you can use pd.pivot:
Suppose you have 3 times the 12 values, creating a dummy example:
data = pd.Series(np.random.randn(12*3))
Now you can construct the desired columns and indices as follows:
col = pd.Series(np.tile(list('ABCDEFGHIJKL'),3))
idx = pd.Series(np.repeat(np.arange(3), 12))
And now:
In [18]: pd.pivot(index=idx, columns=col, values=data.values)
Out[18]:
A B C D E F G \
0 1.296702 0.270532 -0.645502 0.213300 -0.224421 -0.634656 -2.362567
1 -1.986403 1.006665 -1.167412 -0.697443 -1.394925 -0.365205 -1.468349
2 0.689492 -0.410681 0.378916 1.552068 0.144651 -0.419082 -0.433970
H I J K L
0 2.102229 0.538711 -0.839540 -0.066535 1.154742
1 -1.090374 -1.344588 0.515923 -0.050190 -0.163259
2 -0.235364 0.296751 0.456884 0.237697 1.089476
PS: for some reason just using data instead of data.values does not work.
You can also do it with unstack as #TomAugspurger explained:
midx = pd.MultiIndex.from_tuples(zip(idx, col))
data.index = midx
data.unstack()
Say I have two series in pandas, series A and series B. How do I create a dataframe in which all of those values are multiplied together, i.e. with series A down the left hand side and series B along the top. Basically the same concept as this, where series A would be the yellow on the left and series B the yellow along the top, and all the values in between would be filled in by multiplication:
http://www.google.co.uk/imgres?imgurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables/times-table-12x12.gif&imgrefurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables.htm&h=533&w=720&sz=58&tbnid=9B8R_kpUloA4NM:&tbnh=90&tbnw=122&zoom=1&usg=__meqZT9kIAMJ5b8BenRzF0l-CUqY=&docid=j9BT8tUCNtg--M&sa=X&ei=bkBpUpOWOI2p0AWYnIHwBQ&ved=0CE0Q9QEwBg
Sorry, should probably have added that my two series are not the same length. I'm getting an error now that 'matrices are not aligned' so I assume that's the problem.
You can use matrix multiplication dot, but before you have to convert Series to DataFrame (because dot method on Series implements dot product):
>>> B = pd.Series(range(1, 5))
>>> A = pd.Series(range(1, 5))
>>> dfA = pd.DataFrame(A)
>>> dfB = pd.DataFrame(B)
>>> dfA.dot(dfB.T)
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
You can create a DataFrame from multiplying two series of unequal length by broadcasting each value of the row (or column) with the other series. For example:
> row = pd.Series(np.arange(1, 6), index=np.arange(1, 6))
> col = pd.Series(np.arange(1, 4), index=np.arange(1, 4))
> row.apply(lambda r: r * col)
1 2 3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
First create a DataFrame of 1's. Then broadcast multiply along each axis in turn.
>>> s1 = Series([1,2,3,4,5])
>>> s2 = Series([10,20,30])
>>> df = DataFrame(1, index=s1.index, columns=s2.index)
>>> df
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
>>>> df.multiply(s1, axis='index') * s2
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
4 50 100 150
You need to use df.multiply in order to specify that the series will line up with the row index. You can use the normal multiplication operator * with s2 because matching on columns is the default way of doing multiplication between a DataFrame and a Series.
So I think this may get you most of the way there if you have two series of different lengths. This seems like a very manual process but I cannot think of another way using pandas or NumPy functions.
>>>> a = Series([1, 3, 3, 5, 5])
>>>> b = Series([5, 10])
First convert your row values a to a DataFrame and make copies of this Series in the form of new columns as many as you have values in your columns series b.
>>>> result = DataFrame(a)
>>>> for i in xrange(len(b)):
result[i] = a
0 1
0 1 1
1 3 3
2 3 3
3 5 5
4 5 5
You can then broadcast your Series b over your DataFrame result:
>>>> result = result.mul(b)
0 1
0 5 10
1 15 30
2 15 30
3 25 50
4 25 50
In the example I have chosen, you will end up with indexes that are duplicates due to your initial Series. I would recommend leaving the indexes as unique identifiers. This makes programmatic sense otherwise you will return more than one value when you select an index that has more than one row assigned to it. If you must, you can then reindex your row labels and your column labels using these functions:
>>>> result.columns = b
>>>> result.set_index(a)
5 10
1 5 10
3 15 30
3 15 30
5 25 50
5 25 50
Example of duplicate indexing:
>>>> result.loc[3]
5 10
3 15 30
3 15 30
In order to use the DataFrame.dot method, you need to transpose one of the series:
>>> a = pd.Series([1, 2, 3, 4])
>>> b = pd.Series([10, 20, 30])
>>> a.to_frame().dot(b.to_frame().transpose())
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
Also make sure the series have the same name.