How can I speed up this iteration? - python

I have a dataframe with over ten million rows containing 2 columns 'left_index' and 'right_index'.
'left_index' is the index of a value and 'right_index' contains indexes of rows that have a possible match.
The problem is that this contains duplicate matches (Ex: 0,1 and 1,0).
I want to filter this dataframe and only keep one combination of each match.
I'm using a list here for an example.
In: [(0,1), (1,0), (3,567)]
Out: [(0,1), (3, 567)]
The below code produces what I want however it is very slow. Is there a faster way to solve this?
lst2 = []
for i in lst1:
if(i in lst2):
lst1.remove(i)
else:
lst2.append((i[1],i[0]))

Using numpy to keep the first occurrence of a non-unique array:
import numpy as np
lst1 = [(1,0), (0,1), (2, 5), (3,567), (5,2)]
arr = np.array(lst1)
result = arr[np.unique(np.sort(arr), 1, axis=0)[1]]
>>> result
array([[ 1, 0],
[ 2, 5],
[ 3, 567]])

I believe Pandas saves you from using loop.
import pandas as pd
df = pd.DataFrame([
[(0, 0), (0, 0), 123],
[(0, 0), (0, 1), 234],
[(1, 0), (0, 1), 345],
[(1, 1), (0, 1), 456],
], columns=['left_index', 'right_index', 'value'])
print(df)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) (0, 1) 234
2 (1, 0) (0, 1) 345
3 (1, 1) (0, 1) 456
df['left_index_set'] = df['left_index'].apply(set)
df['right_index_set'] = df['right_index'].apply(set)
I am not sure what you need after this point. If you want to filter duplicates you do the following.
df = df[df['left_index_set'] != df['right_index_set']]
df_final1= df[['left_index', 'right_index', 'value']]
print(df_final1)
left_index right_index value
1 (0, 0) (0, 1) 234
3 (1, 1) (0, 1) 456
However, If you do not want to filter dataframe but to modify it:
df.loc[df['left_index_set'] != df['right_index_set'], 'right_index'] = None # None, '' or what you want. It's up to you
df_final2 = df[['left_index', 'right_index', 'value']]
print(df_final2)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) None 234
2 (1, 0) (0, 1) 345
3 (1, 1) None 456

You mention the data is in a dataframe and tagged pandas so we can use numpy to do this work for us using vectorization.
First, since you did not provide a way to create the data, I generated a dataframe per your description using:
import numpy as np
import pandas
def build_dataframe():
def rand_series():
"""Create series of 1 million random integers in range [0, 9999]."""
return (np.random.rand(1000000) * 10000).astype('int')
data = pandas.DataFrame({
'left_index': rand_series(),
'right_index': rand_series()
})
return data.set_index('left_index')
data = build_dataframe()
Since (0,1) is the same as (1,0) per your requirements, lets just create an index that has the values sorted for us. First create two new columns with the minimum and maximum value of left and right index:
data['min_index'] = np.minimum(data.index, data.right_index)
data['max_index'] = np.maximum(data.index, data.right_index)
print(data)
right_index min_index max_index
left_index
4270 438 438 4270
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ... ...
3656 2492 2492 3656
2345 210 210 2345
9241 1934 1934 9241
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we can reset the index to these two new columns (really we just want a multi-index, and this is one way of getting it for us).
data = data.reset_index().set_index(keys=['min_index', 'max_index'])
print(data)
left_index right_index
min_index max_index
438 4270 4270 438
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ...
2492 3656 3656 2492
210 2345 2345 210
1934 9241 9241 1934
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we just want the unique values of the index. This is the operation that takes the most time, but should still be significantly faster than the naive implementation using lists.
unique = data.index.unique()
print (unique)
MultiIndex([( 438, 4270),
(1277, 9378),
( 20, 7080),
(4646, 6623),
(3280, 4481),
(4410, 9367),
(1864, 7881),
( 516, 3287),
(1678, 6946),
(1253, 7890),
...
(6669, 9527),
(1095, 8866),
( 455, 7800),
(2862, 8587),
(8221, 9808),
(2492, 3656),
( 210, 2345),
(1934, 9241),
( 369, 8362),
(5251, 6047)],
names=['min_index', 'max_index'], length=990197)

Related

Python Dataframe categorize values

I have a data coming from the field and I want to categorize it with a gap of specific range.
I want to categorize in 100 range. That is, 0-100, 100-200, 200-300
My code:
df=pd.DataFrame([112,341,234,78,154],columns=['value'])
value
0 112
1 341
2 234
3 78
4 154
Expected answer:
value value_range
0 112 100-200
1 341 200-400
2 234 200-300
3 78 0-100
4 154 100-200
My code:
df['value_range'] = df['value'].apply(lambda x:[a,b] if x>a and x<b for a,b in zip([0,100,200,300,400],[100,200,300,400,500]))
Present solution:
SyntaxError: invalid syntax
You can use pd.cut:
df["value_range"] = pd.cut(df["value"], [0, 100, 200, 300, 400], labels=['0-100', '100-200', '200-300', '300-400'])
print(df)
Prints:
value value_range
0 112 100-200
1 341 300-400
2 234 200-300
3 78 0-100
4 154 100-200
you can use the odd IntervalIndex.from_tuples. Just set the tuple values to the values that are in your data and you should be good to go! -Listen to Lil Wayne
df = pd.DataFrame([112,341,234,78,154],columns=['value'])
bins = pd.IntervalIndex.from_tuples([(0, 100), (100, 200), (200, 300), (300, 400)])
df['value_range'] = pd.cut(df['value'], bins)

creating a dataframe using lists

I am trying to create a dataframe that looks like this excel sheet but I can't figure out how to do so. Here is the code I am attempting to use
import pandas as pd
ls_super = ['Supernatant',total_volume_super_ml, Total_activity_super, total_protein_super_mg, specific_activity_sup, 100, 1]
df3 = pd.DataFrame(ls_super, columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield,','purification'
])
Here is the error message: ValueError Traceback (most recent call last)
/tmp/ipykernel_125580/4224246098.py in
20
21 # list of strings
---> 22 df3 = pd.DataFrame(ls_super, columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield,','purification'
23 ])
24
~/.local/lib/python3.8/site-packages/pandas/core/frame.py in init(self, data, index, columns, dtype, copy)
709 )
710 else:
--> 711 mgr = ndarray_to_mgr(
712 data,
713 index,
~/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
322 )
323
--> 324 _check_values_indices_shape_match(values, index, columns)
325
326 if typ == "array":
~/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
391 passed = values.shape
392 implied = (len(index), len(columns))
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
394
395
ValueError: Shape of passed values is (7, 1), indices imply (7, 7)
Problem: the DataFrame() constructor insists on interpreting the one-dimensional python list ls_super as 1 column and 7 rows ("Shape of passed values is (7, 1)"), as opposed to 1 row with 7 columns (which would have shape=(1, 7)).
Solution: add a second set of brackets ([]) around your definition of the ls_super list. In other words, make ls_super a two-dimensional list. The DataFrame constructor then sees a single value in the first dimension, and seven values in the second dimension, producing the desired shape of (1, 7).
ls_super = [['Supernatant',1, 2, 3, 4, 100, 1]]
df3 = pd.DataFrame(ls_super,
columns=['Sample', 'Total Volume', 'Total activity', 'Total protein', 'Specific Activity', 'percent yield', 'purification'])
import pandas as pd
first_row_list = [
"Supernatant",
total_volume_super_ml,
Total_activity_super,
total_protein_super_mg,
specific_activity_sup,
100,
1
]
columns = [
"Sample",
"Total Volume",
"Total Activity",
"Total protein",
"Specific Activity",
"percent yield",
"purification"
]
d = dict(zip(columns, [[f] for f in first_row_list]))
df = pd.DataFrame(d)
or
d = {'Sample': ["Supernatant"],
'Total Volume': [total_volume_super_ml],
'Total Acitivity': [Total_activity_super],
'Total protein': [total_protein_super_mg],
'Specific Activity': [specific_activity_sup],
'percent yield': [100],
'purification': [1]}
df = pd.DataFrame(d)
In pandas dataframe your input could be an a list of list.
import pandas as pd
from random import uniform
total_volume_super_ml = uniform(0,1)
Total_activity_super = uniform(0,1)
total_protein_super_mg = uniform(0,1)
specific_activity_sup = uniform(0,1)
ls_super = ['Supernatant',total_volume_super_ml, Total_activity_super, total_protein_super_mg, specific_activity_sup, 100, 1]
df3 = pd.DataFrame([ls_super], columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield','purification'])
df3

Conditional Probability in Python

I have a dataframe where I have six columns that are coded 1 for yes and 0 for no. There is also a column for year. The output I need is finding the conditional probability between each column being coded 1 according to year. I tried incorporating some suggestions from this post: Pandas - Conditional Probability of a given specific b but with no luck. Other things I came up with are inefficient. I am really struggling to find the best way to go about this.
Current dataframe:
Output I am seeking:
To get your wide-formatted data into the long format of linked post, consider running melt and then run a self merge by year for all pairwise combinations (avoiding same keys and reverse duplicates). Then calculate as linked post shows:
long_df = current_df.melt(
id_vars = "Year",
var_name = "Key",
value_name = "Value"
)
pairwise_df = (
long_df.merge(
long_df,
on = "Year",
suffixes = ["1", "2"]
).query("Key1 < Key2")
.assign(
Both_Occur = lambda x: np.where(
(x["Value1"] == 1) & (x["Value2"] == 1),
1,
0
)
)
)
prob_df = (
(pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].value_counts() /
pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].count()
).to_frame(name = "Prob")
.reset_index()
.query("Both_Occur == 1")
.drop(["Both_Occur"], axis = "columns")
)
To demonstrate with reproducible data
import numpy as np
import pandas as pd
np.random.seed(112621)
random_df = pd.DataFrame({
'At least one tree': np.random.randint(0, 2, 100),
'At least two trees': np.random.randint(0, 2, 100),
'Clouds': np.random.randint(0, 2, 100),
'Grass': np.random.randint(0, 2, 100),
'At least one mounain': np.random.randint(0, 2, 100),
'Lake': np.random.randint(0, 2, 100),
'Year': np.random.randint(1983, 1995, 100)
})
# ...same code as above...
prob_df
Year Key1 Key2 Prob
0 1983 At least one mounain At least one tree 0.555556
2 1983 At least one mounain At least two trees 0.555556
5 1983 At least one mounain Clouds 0.416667
6 1983 At least one mounain Grass 0.555556
8 1983 At least one mounain Lake 0.555556
.. ... ... ... ...
351 1994 At least two trees Grass 0.490000
353 1994 At least two trees Lake 0.420000
355 1994 Clouds Grass 0.280000
357 1994 Clouds Lake 0.240000
359 1994 Grass Lake 0.420000

How do a join two columns into another seperate column in Pandas?

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.
Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]
I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)
Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

compute the difference of all possible rows

Based on a selection ds of a dataframe d with:
{ 'x': d.x, 'y': d.y, 'a':d.a, 'b':d.b, 'c':d.c 'row:d.n'})
Having n rows, x ranges from 0 to n-1. The column n is needed since it's a selection and indices need to be kept for a later query.
How do you efficiently compute the difference between each row (e.g.a_0, a_1, etc) of each column (a, b, c) without losing the rows information (e.g. new column with the indices of the rows that were used) ?
MWE
Sample selection ds:
x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291
Desired output:
dist euclidean distance math.hypot(x2 - x1, y2 - y1)
da, db, dc for da: np.abs(a1-a2)
ns a string with both ns of the employed rows
the result would look like:
dist da db dc ns
42.61365102824963 993 340 241 146-225
293.82347069813255 8181 2132 4740 146-291
.. .. .. .. 225-291
You can use itertools.combinations() to generate the pairs:
Read data first:
import pandas as pd
from io import StringIO
import numpy as np
text = """ x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
Create the index and calculate the results:
from itertools import combinations
index = np.array(list(combinations(range(df.shape[0]), 2)))
df1, df2 = [df.iloc[idx].reset_index(drop=True) for idx in index.T]
res = pd.concat([
np.hypot(df1.x - df2.x, df1.y - df2.y),
df1[["a", "b", "c"]] - df2[["a", "b", "c"]],
df1.n.astype(str) + "-" + df2.n.astype(str)
], axis=1)
res.columns = ["dist", "da", "db", "dc", "ns"]
res
the output:
dist da db dc ns
0 42.613651 993 340 241 146-225
1 293.823471 8181 2132 4740 146-291
2 294.702805 7188 1792 4499 225-291
This approach makes good use of Pandas and the underlying numpy capabilities, but the matrix manipulations are a little hard to keep track of:
import pandas as pd, numpy as np
ds = pd.DataFrame(
[
[554.607085, 400.971878, 9789, 4151, 6837, 146],
[512.231450, 405.469524, 8796, 3811, 6596, 225],
[570.427284, 694.369140, 1608, 2019, 2097, 291]
],
columns = ['x', 'y', 'a', 'b', 'c', 'n']
)
def concat_str(*arrays):
result = arrays[0]
for arr in arrays[1:]:
result = np.core.defchararray.add(result, arr)
return result
# Make a panel with one item for each column, with a square data frame for
# each item, showing the differences between all row pairs.
# This creates perpendicular matrices of values based on the underlying numpy arrays;
# then numpy broadcasts them along the missing axis when calculating the differences
p = pd.Panel(
(ds.values[np.newaxis,:,:] - ds.values[:,np.newaxis,:]).transpose(),
items=['d'+c for c in ds.columns], major_axis=ds.index, minor_axis=ds.index
)
# calculate euclidian distance
p['dist'] = np.hypot(p['dx'], p['dy'])
# create strings showing row relationships
p['ns'] = concat_str(ds['n'].values.astype(str)[:,np.newaxis], '-', ds['n'].values.astype(str)[np.newaxis,:])
# remove unneeded items
del p['dx'], p['dy'], p['dn']
# convert to frame
diffs = p.to_frame().reindex_axis(['dist', 'da', 'db', 'dc', 'ns'], axis=1)
diffs
This gives:
dist da db dc ns
major minor
0 0 0.000000 0 0 0 146-146
1 42.613651 993 340 241 146-225
2 293.823471 8181 2132 4740 146-291
1 0 42.613651 -993 -340 -241 225-146
1 0.000000 0 0 0 225-225
2 294.702805 7188 1792 4499 225-291
2 0 293.823471 -8181 -2132 -4740 291-146
1 294.702805 -7188 -1792 -4499 291-225
2 0.000000 0 0 0 291-291

Categories