Fraction of values in (x, y) space - python

I have a data frame that looks like this, but with several hundred thousand rows:
df
D x y
0 y 5.887672 6.284714
1 y 9.038657 10.972742
2 n 2.820448 6.954992
3 y 5.319575 15.475197
4 n 1.647302 7.941926
5 n 5.825357 13.747091
6 n 5.937630 6.435687
7 y 7.789661 11.868023
8 n 2.669362 11.300062
9 y 1.153347 17.625158
I want to know what proportion of values ("D") in each x:y grid space is "n".
I can do it by brute force, by stepping through x and y and calculating the percentage:
zonexy = {}
for x in np.arange(0,10,2.5):
dfx = df[(df['x'] >= x) & (df['x'] < x+2.5)]
zonexy[x] = {}
for y in np.arange(0,24,6):
dfy = dfx[(dfx['y'] >= y) & (dfx['y'] < y+6)]
try:
pctn = len(dfy[dfy['Descr']=='n'])/len(dfy) * 100.0
except ZeroDivisionError:
pctn = 0
zonexy[x][y] = pctn
Output:
pd.DataFrame(zonexy)
0.0 2.5 5.0 7.5
0 0 0 0 0
6 100 100 50 0
12 0 0 50 0
18 0 0 0 0
But this, and all the variations on this theme that I've tried, is very slow. It seems like there should be a much more efficient way (probably via numpy), but I'm blanking on it.

One way would be to use the 2D histogram function of numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html
Then,
Run it once on the data where the criteria is matched (here, "D" is "n")
Run it again on all of the data.
Divide the first result, element-by-element, with the second result.

Related

Is there a python library for representing conditionals of two values as a matrix/table?

We're trying to figure out a way to easily pull values from what I guess I would describe as a grid of conditional statements. We've got two variables, x and y, and depending on those values, we want to pull one of (something1, ..., another1, ... again1...). We could definitely do this using if statements, but we were wondering if there was a better way. Some caveats: we would like to be able to easily change the bounds on the x and y conditionals. The problem with a bunch of if statements is that it's not very easy to compare the values of those bounds with the values in the example table below.
Example:
So if x = 4% and y = 30%, we would get back another1. Whereas if x = 50% and y = 10%, we would get something3.
Overall two questions:
Is there a general name for this kind of problem?
Is there an easy framework or library that could do this for us without if statements?
Even though Pandas is not really made for this kind of usage, with function aggregation and boolean indexing it allows for an elegant-ish solution for your problem. Alternatively, constraint-based programing might be an option (see python-constraint on pypi).
Define the constraints as functions.
x_constraints = [lambda x: 0 <= x < 5,
lambda x: 5 <= x < 10,
lambda x: 10<= x < 15,
lambda x: x >= 15
]
y_constraints = [lambda y: 0 <= y < 20,
lambda y: 20 <= y < 50,
lambda y: y >= 50]
x = 15
y = 30
Now we want to make two dataframes: One that only holds the x-values,
and another that only holds the y-values where number of columns = number of x-constraints and number of rows = number of y-constraints.
import pandas as pd
def make_dataframe(value):
return pd.DataFrame(data=value,
index=range(len(y_constraints)),
columns=range(len(x_constraints)))
x_df = make_dataframe(x)
y_df = make_dataframe(y)
The dataframes look like this:
>>> x_df
0 1 2 3
0 15 15 15 15
1 15 15 15 15
2 15 15 15 15
>>> y_df
0 1 2 3
0 30 30 30 30
1 30 30 30 30
2 30 30 30 30
Next, we need the dataframe label_df that holds the possible outcomes. The shape must match the dimension of x_df and y_df above. (What's cool about this is that you can store the data in a
CSV-file and directly read it into a dataframe with pd.read_csv if you wish.)
label_df = pd.DataFrame([[f"{w}{i+1}" for i in range(len(x_constraints))] for w in "something another again".split()])
>>> label_df
0 1 2 3
0 something1 something2 something3 something4
1 another1 another2 another3 another4
2 again1 again2 again3 again4
Next, we want to apply the x_constraints to the columns of x_df, and the y_constraints to the rows of y_df. .aggregate takes
a dictionary that maps column or row names to functions {colname: func},
which we construct inline using dict(zip(...)). axis=1 means "apply the functions row-wise".
x_mask = x_df.aggregate(dict(zip(x_df.columns, x_constraints)))
y_mask = y_df.aggregate(dict(zip(y_df.columns, y_constraints)), axis=1)
The result are two dataframes holding boolean values, and ideally,
there should be exactly one column in x_mask and one row in y_mask that's all True, e.g.
>>> x_mask
0 1 2 3
0 False False False True
1 False False False True
2 False False False True
>>> y_mask
0 1 2 3
0 False False False False
1 True True True True
2 False False False False
If we combine them with bit-wise and &, we get a boolean mask with exactly
one True value.
>>> m = x_mask & y_mask
>>> m
0 1 2 3
0 False False False False
1 False False False True
2 False False False False
Use m to select the target value from label_df. The result df is all NaN except one value, which we extract with df.stack().iloc[0]:
>>> df = label_df[m]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN another4
2 NaN NaN NaN NaN
>>> df.stack().iloc[0]
'another4'
And that's it! It should be very easy to maintain, by just changing the list of constraints and adapting the possible outcomes in label_df.
I didn't hear about any name.
If (ha-ha) it should be more conceptually close to you, I might suggest that you create two mapper functions that would map x and y values to the categories of your contingency table.
map_x = lambda x: 0 if x < 0.05 else 1 if x < 0.1 else 2
map_y = lambda y: 0 if y < 0.2 else 1 if y < 0.5 else 2
df.iloc[map_x(x), map_y(y)]
If you have just a handful of conditionals then you may define two lists with the upper bounds, and use a simple linear search:
x_bounds = [0.05, 0.1, 1.0]
y_bounds = [0.2, 0.5, 1.0]
def linear(x_bounds, y_bounds, x, y):
for i,xb in enumerate(x_bounds):
if x <= xb:
break
for j,yb in enumerate(y_bounds):
if y <= yb:
break
return i,j
linear(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
If there are many conditionals a binary search will be better:
def binary(x_bounds, y_bounds, x, y):
lower = 0
upper = len(x_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if x_bounds[mid] < x:
lower = mid
elif x_bounds[mid] >= x:
if mid > 0 and x_bounds[mid-1] < x:
xmid = mid
break
else:
xmid = mid-1
break
else:
upper = mid
lower = 0
upper = len(y_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if y_bounds[mid] < y:
lower = mid
elif y_bounds[mid] >= y:
if mid > 0 and y_bounds[mid-1] < y:
ymid = mid
break
else:
ymid = mid-1
break
else:
upper = mid
return xmid,ymid
binary(x_bounds, y_bounds, 0.4, 3.0) #(0,1)

Creating matrix with for loop in python

I have a list with 4 elements. Each element is a correct score that I am pulling from a form. For example:
scoreFixed_1 = 1
scoreFixed_2 = 2
scoreFixed_3 = 3
scoreFixed_4 = 4
scoreFixed = [scoreFixed_1, scoreFixed_2, scoreFixed_3, scoreFixed_4]
Then, I need to add:
scoreFixed_1 to fixture[0][0]
scoreFixed_2 to fixture[0][1]
scoreFixed_3 to fixture[1][0]
scoreFixed_4 to fixture[1][1]
Hence, I need to create a triple for loop that outputs the following sequence so I can index to achieve the result above:
0 0 0
1 0 1
2 1 0
3 1 1
I have tried to use this to create this matrix, however I am only able to get the first column correct. Can anyone help?
for x in range(1):
for y in range(1):
for z in range(4):
print(z, x, y)
which outputs:
0 0 0
1 0 0
2 0 0
3 0 0
Your logic does not generate the table, you want something like:
rownum = 0
for x in range(2):
for y in range(2):
print (rownum, x, y)
rownum += 1
(Edit: The question has been changed, to accomplish the new desire, you want something like this:)
scoreIndex = 0
for x in range(2):
for y in range(2):
fixture[x][y] += scoreFixed[scoreIndex]
scoreIndex += 1
After your edit, it seems like we can split the 'sequence' into:
First column, regular ascending variable ( n += 1)
Second and third column, binary counter (00, 01, 10, 11)
0 0 0
1 0 1
2 1 0
3 1 1
^ ^------- These seem like a binary counter
(00, 01, 10, 11)
^------ A regular ascending variable
( n += 1 )
Using that 'logic' we can create a code that looks like
import itertools
scoreFixed = 0
for i in itertools.product([0,1],repeat=2):
print(scoreFixed, ' '.join(map(str,i)))
scoreFixed += 1
And wil output:
0 0 0
1 0 1
2 1 0
3 1 1
As you can test in this online demo
for x in range(4):
z = int(bin(x)[-1])
y = bin(x)[-2]
y = int(y) if y.isdigit() else 0
print(x, y, z)

extracting data from panda series as per the index of another panda series

There are three panda series
x = pd.Series([220,340,500,600,700,900,540,60])
y = pd.Series([2,1,2,2,1])
z = pd.Series([])
Each element of y will denote how many elements to add and to be put into z
example : if series has 2 in the start, then i will add first two elements at the start 220 and 340 to get 560 and then put it in z as its first element. Next I have 1 in y that means i will take 500 from x (third element) and put it in z as its second element and so on
Here is what I have tried
j = 0
for i in y:
par = y[i]
z[i] = x[j:par + j].sum()
j = j+par
Groupby y's index repeated:
x.groupby(y.index.repeat(y)).sum()
0 560
1 500
2 1300
3 1440
4 60
dtype: int64
If the length mismatches, this will lead to a ValueError. In that case, a safer alternative is to groupby the cumsum, repeated, and reset the index:
x.groupby(y.cumsum().repeat(y).reset_index(drop=True)).sum()
Here's my take:
df = x.to_frame(name='x').reset_index(drop=True)
df['cat'] = pd.cut(df.index+1, y.cumsum(), labels=False)
df['cat'] = df['cat'].fillna(-1).add(1)
z = df.groupby('cat').x.sum()
Out:
cat
0.0 560
1.0 500
2.0 1300
3.0 1440
4.0 60
Name: x, dtype: int64
it is the index conflict issue, just update your loop to use a range instead
j = 0
for i in range(len(y)):
par = y[i]
print('first',i)
z[i] = x[j:par + j].sum()
print('second',j,'par',par)
j = j+par
>> z
0 560
1 500
2 1300
3 1440
4 60

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

COUNTIF in pandas python over multiple columns with multiple conditions

I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:
Person_ID Age Smoker Diabetes
001 30 Y N
002 45 N N
003 27 N Y
004 18 Y Y
005 55 Y Y
Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:
Person_ID Age Smoker Diabetes Risk_Factors
001 30 Y N 1
002 25 N N 0
003 27 N Y 1
004 18 Y Y 2
005 55 Y Y 3
I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:
=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")
However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?
I would do this the following way.
For each column, create a new boolean series using the column's condition
Add those series row-wise
(Note that this is simpler if your Smoker and Diabetes column is already boolean (True/False) instead of in strings.)
It might look like this:
df = pd.DataFrame({'Age': [30,45,27,18,55],
'Smoker':['Y','N','N','Y','Y'],
'Diabetes': ['N','N','Y','Y','Y']})
Age Diabetes Smoker
0 30 N Y
1 45 N N
2 27 Y N
3 18 Y Y
4 55 Y Y
#Step 1
risk1 = df.Age > 45
risk2 = df.Smoker == "Y"
risk3 = df.Diabetes == "Y"
risk_df = pd.concat([risk1,risk2,risk3],axis=1)
Age Smoker Diabetes
0 False True False
1 False False False
2 False False True
3 False True True
4 True True True
df['Risk_Factors'] = risk_df.sum(axis=1)
Age Diabetes Smoker Risk_Factors
0 30 N Y 1
1 45 N N 0
2 27 Y N 1
3 18 Y Y 2
4 55 Y Y 3
If you want to stick with pandas. You can use the following...
Solution
isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)
df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)
How it works
isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0
countRiskFactors - adds up the risk factors
the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.
output of print df
Person_ID Age Smoker Diabetes Risk_Factors
0 1 30 Y N 1
1 2 45 N N 0
2 3 27 N Y 1
3 4 18 Y Y 2
4 5 55 Y Y 3
If you are starting from excel and want to go to the next evolution then I would recommend MS access. It will be a lot easier then learning Panda for python. You should just replace the CountIf() with:
Risk Factor: IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)

Categories