I have two tables Table 1 and Table 2. Table 1 has one column and Table 2 has two columns. I am giving below an example of my two tables to further explain what I am trying to do.
TABLE 1 TABLE 2
A B C
0.015 0.000 14.0 #The BINS are 0.00-0.01 = 14.0
0.033 0.025 14.5 # 0.01-0.02 = 14.5
0.042 0.050 15.0 # 0.02-0.03 = 15.0
0.501 0.075 15.5 # 0.03-0.04 = 15.5 AND SO ON
0.505 0.100 16.0
0.520 0.125 16.5
0.350 0.150 17.0
Here if we take BINS in column B, i.e 0.0 to 0.01 and 0.01 to 0.02 and so on.
I would like to select the column A in Table 1, take the first value (0.015) find out the range (BIN) in which it lies (we can see that it lies between 0.000 and 0.025), and I would like to add a second column to table 1, and give it the value 14.5 (second BIN from table 2).
I would like to repeat the same for the second value of table 1, i.e 0.033, we can see it lies between 0.025 and 0.050, so we give it value 15.5 (from table 2). and so on.
The problem is, the only way I know to iterate is using for loops,
for a in A: #takes the values of column A in table 1
But here I don't know how to proceed further. i.e. How do I check which BIN of column B does my column A value lie? so that I can give it the corresponding value from column C
You can iterate through a list using for i, x in enumerate(X). This gives you both the element of the list and the index of that element. You could also use for i in range(len(X)), since in your case you may need to do a look-ahead. Maybe this will work for a solution with arbitrary bin sizes:
A2 = []
for a in A:
for i in range(len(B)-1):
if a < B[i+1]:
A2.append(C[i])
break
else: # We never broke out
A2.append(C[-1])
We compare each element in A to progressively greater elements in B. If the element a is less than the value of a list element in B, then it belongs in the previous bin (i.e. 0.015 from A is less than 0.025 in B and thus belongs in the previous bin). A breakdown, since you asked:
A2 = [] # Make a new list
for a in A: # Do the below once for every element in A
for i in range(len(B)-1):
Instead of iterating directly over B, we're looping through the possible indexes (which start at 0 and end at len(B)-1). However, we're actually going one less than that. If you use range(10), you end up with 0...9. So if you want to iterate over all of B, you can just use range(len(B)). But we actually want to go one less than the full length of B, because in the next step, we're looking ahead.
if a < B[i+1]:
Here we're looking one list index ahead, to see if a is less than the B element at index i+1. If it is, then we want to find the element of C that corresponds to the previous index, i.e. index i. For example, given 0.015 from list A, we look at 0.025 from B. 0.015 < 0.025, so that means 0.015 belongs in the previous bin. That's why we're looking ahead by one.
A2.append(C[i])
break
Grab the element of C that corresponds to index i (no longer looking ahead, since we know i is the correct bin as i+1 is too large) and toss it into A2. Then break out of the inner for loop and start again with the next element of A.
else: # We never broke out
A2.append(C[-1])
This else statement executes if we never break out of the for loop. In this case, a can only possibly be in the final bin, so we just grab the element from C that's at the end of the list (which [-1] will do automatically).
You can just multiply a by 40 and then convert it to int and use that as the index in table 2.
For example, take the first value (0.015) and multiply it by 40 (0.6) and convert it to int (0) and you have the index in table 2 that you want.
D = list()
for a in A:
index = int(a*40)
try:
corresponding_value_from_c = C[index]
except IndexError:
corresponding_value_from_c = C[-1]
D.append(correspondin_value_from_c)
At the end, D will be the column containing all the values that you need.
Related
I have the following table coming from SQL:
b_num loc max_neg_m
10 0.89 -480050
10 0.93 -144107
10 0.96 -364381
10 1 -699219
100 0 -904713
100 0.2 -772981
100 0.41 -645617
100 0.5 -563448
100 0.67 -520747
100 1 -218334
1000 0 -112367
1000 0.05 15976
with a python code, I'm looking to store a 2D array of loc, max_neg_m for each unique b_num.
a=dict()
a.keys(b_num)
a.values=[loc,max_neg_m]
I appreciate any help.
I'm reluctant to say much because I'm a bit confused by the three lines of code you've given. If they're meant to be an idea of the end result, then perhaps sort of schematically:
Create your dictionary 'a' as an empty dictionary.
Then cycle through your b_num, testing the dictionary to see if the dictionary has the current b_num as a key.
If it does, append the associated [loc, max_neg_m] to that dictionary value.
If it does not, assign the key-value pair of the key and an empty array, then pretend the dictionary does have the key (because now it does!).
Yes, storing 2D arrays as dictionary values is absolutely possible. Do this:
dictionary = {}
# Replace ``sql_query_result`` with the appropriate value.
for b_num, loc, max_neg_m in sql_query_result:
if b_num not in dictionary.keys():
dictionary[b_num] = []
dictionary[b_num].append([loc, max_neg_m])
I need to perform the following steps on a data-frame:
Assign a starting value to the "balance" attribute of the first row.
Calculate the "balance" values for the subsequent rows based on value of the previous row using the formula for eg : (previous row balance + 1)
I have tried the following steps:
Created the data-frame:
df = pd.DataFrame(pd.date_range(start = '2019-01-01', end = '2019-12-31'),columns = ['dt_id'])
Created attribute called 'balance':
df["balance"] = 0
Tried to conditionally update the data-frame:
df["balance"] = np.where(df.index == 0, 100, df["balance"].shift(1) + 1)
Results:
From what I can observe, the value is being retrieved for subsequent update before it can be updated in the original data-frame.
The desired output for "balance" attribute :
Row 0 : 100
Row 1: 101
Row 2 : 102
And so on
If I understand correctly if you add this line of code after yours, you are ready:
df["balance"].cumsum()
0 100.0
1 101.0
2 102.0
3 103.0
4 104.0
...
360 460.0
361 461.0
362 462.0
363 463.0
364 464.0
It is a cumulative sum, it sums its value with the previous one and since you have the starting value and then ones it will do what you want.
The problem you have is, that you want to calculate an array and the elements are dependent on each other. So, e.g., element 2 depends on elemen 1 in your array. Element 3 depends on element 2, and so on.
If there is a simple solution, depends on the formula you use, i.e., if you can vectorize it. Here is a good explanation on that topic: Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
In your case a simple loop should do it:
balance = np.empty(len(df.index))
balance[0] = 100
for i in range(1, len(df.index)):
balance[i] = balance[i-1] + 1 # or whatever formula you want to use
Please note, that above is the general solution. Your formula can be vectorized, thus also be generated using:
balance = 100 + np.arange(0, len(df.index))
I have sorted a roughly 1 million row dataframe by a certain column. I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this.
Example below:
import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]
df = pd.DataFrame({'value1':value1,'value2':value2})
df.sort_values('value1', ascending = False)
df['wanted_result'] = [1,1,1,2,2,2]
Like this example, I want to sum my column (example column value1) and assign groups to have as close to equal value1 sums as they can. Is there a build in function to this?
Greedy Loop
Using Numba's JIT to quicken it up.
from numba import njit
#njit
def partition(c, n):
delta = c[-1] / n
group = 1
indices = [group]
total = delta
for left, right in zip(c, c[1:]):
left_diff = total - left
right_diff = total - right
if right > total and abs(total - right) > abs(total - left):
group += 1
total += delta
indices.append(group)
return indices
df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))
value1 value2 result
4 28 0.50 1
1 27 0.43 1
0 25 0.34 1
3 22 0.43 2
2 20 0.54 2
5 20 0.70 2
This is NOT optimal. This is a greedy heuristic. It goes through the list and finds where we step over to the next group. At that point it decides whether it's better to include the current point in the current group or the next group.
This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once.
But like I said, it should be good enough.
I think, this is a kind of optimalisation problem (non-linear)
and Pandas is definitively not any good candidate to solve it.
The basic idea to solve the problem can be as follows:
Definitions:
n - number of elements,
groupNo - the number of groups to divide into.
Start from generating an initial solution, e.g. take consecutive
groups of n / groupNo elements into each bin.
Define the goal function, e.g. sum of squares of differences between
sum of each group and sum of all elements / groupNo.
Perform an iteration:
for each pair of elements a and b from different bins,
calculate the new goal function value, if these elements were moved
to the other bin,
select the pair that gives the greater improvement of the goal function
and perform the move (move a from its present bin to the bin, where b is,
and vice versa).
If no such pair can be found, then we have the final result.
Maybe someone will propose a better solution, but at least this solution is
some concept to start with.
I have a dataframe that looks like the attached picture.
i want to write a function that returns the last entry of each row : 30.35, 76.06, 1.53
i can do this for each line, but not for the entire dataframe:
DataFrame.loc[DataFrame[[('Price', 'A')]].last_valid_index()][[('Price', 'A')]][0]
additionally i want to take the difference of the last two entries per column, and the average of the last two entries per column. The uneven-ness of the dataframe is killing me. also i'm brand new to python.
Price
Security A B C
Date
12/31/2016 60.5 76.0351 0.83
1/31/2017 59.5 75.7433 -0.01
2/28/2017 63.15 75.7181 0.25
3/31/2017 61.7 76.0605 1.53
4/30/2017 60.35 NaN NaN
To rerun the last entry of each row, you can use list indexing as follows:
for i in my_list:
variable = my_list[i][-1] # returns the last variable of row in list my_list
return variable
You can append the concerned variables to other lists, and use list indexing again to compare.
I have a dataframe that looks like this (but longer):
OnsetTime OffsetTime OnSec OffSec RTsec TrialDur
36163 38165 36.163 38.165 0.000 2.002
39157 41152 39.157 41.152 0.605 1.995
42152 44155 42.152 44.155 0.509 2.003
45164 47153 45.164 47.153 0.503 1.989
48159 50161 48.159 50.161 0.558 2.002
I want to make a new column that would, for each row, add the values in the TrialDur column above but not including it. and it would need to add on .001 of a second, since TrialDur is trial duration, and I want my new column to indicate the time when a new stimulus came on the screen. so it would look like this:
NewVar
0
2.003
3.999
6.003
7.993
9.996
The first row would be 0 since the first stimulus started at timepoint 0. The second would be right after the first trial ended (based on the TrialDur variable), at 2.003 seconds, and so on.
How do I make a variable that adds the values above it in each row?
You can use cumsum to compute the cumulative sum (add 0.001 before that), then shift that column by 1, finally set the first row to be 0.
df['NewVar'] = (df.TrialDur + 0.001).cumsum()
df.loc[df.index[-1]+1, 'NewVar'] = 0
df['NewVar'] = df.NewVar.shift(1)
df.loc[0, 'NewVar'] = 0
Because NewVar has one more row, so I first add one empty row at the end, and I assume that the index is in numerical order.