I want to create a dataframe which has 3 columns:
cols = ('ID', 'Y_Start','X_Start')
I got it this far with the help of Prune´s answer
stepsminus = -0.0009009009
steps = 0.0009009009
List1 = [] # 35
for i in np.arange(48.34, 48.309, stepsminus):
List1.append(i)
List2 = [] # 100
for i in np.arange(16.0108, 16.1, steps):
List2.append(i)
df = pd.DataFrame(columns=cols)
df['ID'] = list(range(1, 3501))
Now I want to enter the X and Y_start values accordingly. In every Row, there are 100 columns with different values and in every column there are 35 rows with different values. But the values from row to row and from column to column are of course the same values. I wanted to solve this with 2 for-loops,
However THIS is where I am stuck. THIS is where I need some help
df = pd.DataFrame(columns=cols)
df['ID'] = list(range(0, 3500))
y = -1
for pos_y in range(0, 35): # 35
x = 0
y = y + 1
for pos_x in range(0, 100): # 100
df['Y_Start'].iloc[y] = List_Y[pos_y]
df['X_Start'].iloc[x] = List_X[pos_x]
x = x + 1
df.head(102)
Outputs
ID Y_Start X_Start
0 0 48.34 16.0108
1 1 48.339099 16.011701
2 2 48.338198 16.012602
3 3 48.337297 16.013503
4 4 48.336396 16.014404
... ... ... ...
97 97 NaN 16.098187
98 98 NaN 16.099088
99 99 NaN 16.099989
100 100 NaN NaN
101 101 NaN NaN
102 rows × 3 columns
I want something like this:
ID Y_Start X_Start
0 1 48.34 16.0108
1 2 48.34 16.011701
2 3 48.34 16.012602
3 4 48.34 16.013503
4 5 48.34 16.014404
This is much easier than you make it. You're simply counting:
df['ID'] = list(range(1, 3501))
Apply the same range iteration for each of the other two rows. There may also be cases where you'll want to use NumPy's range slicing to generate your list.
Second part of problem, after OP update:
The long-term problem is that you're trying to apply iteration skills you haven't yet developed. Please return to your basic materials on loops and work on those until you learn to think in terms of a loop as a single control concept, rather than a series of disconnected operations.
That said, the central problem here is that, although you want 3500 rows of results from your nested loops, there is no attempt to do anything with an index that runs to 3500 values.
The auxiliary problem is that you've added "shadow" variables x and y, which do nothing except maintain the same values as your loop indices. As given, you should dump those variables and simply use pos_x and pos_y.
Now, for the actual solution. First, we'll repair the loop. For a give DF row k, you have to extract the x and y coordinates from your 2D array. You have already done this in the opposite direction, in your original post. Use the well-traveled arithmetic to get those:
for row in range(3500):
pos_x = row % 100
pos_y = row // 100
df['X_Start'].iloc[row] = List_X[pos_x]
df['Y_Start'].iloc[row] = List_Y[pos_y]
However, I recommend that you do this with a single assignment from a constructed list of 3500 values: just what I recommended in the top part of this post. Replicating elements and replicating an entire list, are techniques for you to look up, or simply derive from elementary list operations.
Related
I have a dataframe(df) like below (there are more rows actually).
number
0
21
1
35
2
467
3
965
4
2754
5
34r
6
5743
7
841
8
8934
9
275
I want to insert multiple 6 rows in between rows for example I want to get random 6 values within range of index 0 and 1 and add these 6 rows between index 0 and 1.
Same goes to index 1 and 2, 2 and 3 and so forth until the end.
np.linspace(df["number"][0], df["number"][1],8)
Is there a function or any other method to generate 6 additional rows between all existing 9 rows so therefore the final number of rows will be not 9 but 64 rows (after adding 54 rows)?
You could try the following:
from random import uniform
def rng_numbers(row):
left, right = row.iat[0], row.iat[1]
n = left
if pd.isna(right):
return [n]
if right < left:
left, right = right, left
return [n] + [uniform(left, right) for _ in range(6)]
df["number"] = (
pd.concat([df["number"], df["number"].shift(-1)], axis=1)
.apply(rng_numbers, axis=1)
)
df = df.explode("number", ignore_index=True)
First create a dataframe with 2 columns that form the interval boundaries: the number column and number column shifted 1 forth.
Then .apply the function rng_numbers to the rows of the new dataframe: rng_numbers first sorts the interval boundaries and then returns a list that starts with the resp. item from column number and then num_rows many random numbers in the interval. In the last row the left boundary is NaN (due to the .shift(-1)): in this case the function returns the list without the random numbers.
Then .explode df on the new column number.
You could do something similar with NumPy, which is probably faster:
rng = np.random.default_rng()
limits = pd.concat([df["number"], df["number"].shift(-1)], axis=1)
left = limits.min(axis=1).values.reshape(-1, 1)
right = limits.max(axis=1).values.reshape(-1, 1)
df["number"] = (
pd.Series(df["number"].values.reshape(len(df), 1).tolist())
+ pd.Series(rng.uniform(left, right, size=(len(df), 6)).tolist())
)
df["number"].iat[-1] = df["number"].iat[-1][:1]
df = df.explode("number", ignore_index=True)
I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.
I have a dataframe that has following columns: X and Y are Cartesian coordinates and Value is the value of element at these coordinates. What I want to achieve is to select only one coordinates out of n that are close to other, lets say coordinates are close if distance is lower than some value m, so the initial DF looks like this (example):
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
X Y Value
0 0 0 6
1 0 1 7
2 0 4 4
3 1 2 5
4 1 6 6
5 5 5 5
6 6 6 6
7 7 4 4
8 8 8 8
distance is count with following function:
def countDistance(lat1, lon1, lat2, lon2):
#use basic knowledge about triangles - values are in meters
distance = sqrt(pow(lat1-lat2,2)+pow(lon1-lon2,2))
return distance
lets say if we want to m<=3, the output dataframe would look like this:
X Y Value
1 0 1 7
4 1 6 6
8 8 8 8
What is to be done:
rows 0,1,3 are close, highest value is in row 1, continue
rows 2 and 4 (from original df) are close, keep row 4
rows 5,6,7 are close, keep row 6
left over row 6 is close to row 8, keep row 8, has higher value
So I need to go through dataframe row by row, check the rest, select best match and then continue. I can't think about any simple method how to achieve this, this cant be use case of drop_duplicates, since they are not duplicates, but looping over the whole DF will be very inefficient. One method I could think about was to loop just once, for each of rows finds close ones (probably apply countdistance()), select the best fitting row and replace rest with its values, in the end use drop_duplicates. The other idea was to create a recursive function that would create a new DF, then while original df will have rows select first, find close ones, best match append to new DF, remove first row and all close from original DF and continue until empty, then return same function with new DF as to remove possible uncaught close points.
These ideas are all kind of inefficient, is there a nice and efficient pythonic way to achieve this?
For now, I have created simple code with recursion, the code works but is most likely not optimal.
def recModif(self,df):
#columns=['','X','Y','Value']
new_df = df.copy()
new_df = new_df[new_df['Value']<0] #create copy to work with
changed = False
while not df.empty: #for all the data
df = df.reset_index(drop=True) #need to reset so 0 is always accessible
x = df.loc[0,'X'] #first row x and y
y = df.loc[0,'Y']
df['dist'] = self.countDistance(x,y,df['X'],df['Y']) #add column with distances
select = df[df['dist']<10] #number of meters that two elements cant be next to other
if(len(select.index)>1): #if there is more than one elem close
changed = True
#print(select,select['Value'].idxmax())
select = select.loc[[select['Value'].idxmax()]] #get the highest one
new_df = new_df.append(pd.DataFrame(select.iloc[:,:3]),ignore_index=True) #add it to new df
df = df[df['dist'] >= 10] #drop the elements now
if changed:
return self.recModif(new_df) #use recursion if possible overlaps
else:
return new_df #return new df if all was OK
I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.
name x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1 x y 2 3 2 1
gene2 x y 5 7 6 2
My goal for each gene is to find the mean of each set of repetitions.
At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.
My thinking right now is to use something like this:
for row in df:
df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean()
But I have no idea how to actually make this work.
The df.iloc[3] is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.
I don't even know where to begin in terms of "merging" the 3 columns into only 1.
Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.
I would first build a Series of final names indexed by the original columns:
names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
index = df.columns[3:])
I would then use it to ask a mean of a groupby on axis 1:
tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')
It gives a new dataframe indexed like the original one and having the averaged columns:
gh_00hr_bio gh_06hr_bio
0 2.333333 1.0
1 6.000000 2.0
You can then horizontally concat it to the first dataframe or to its 3 first columns:
result = pd.concat([df.iloc[:, :3], tmp], axis=1)
to get:
name x y gh_00hr_bio gh_06hr_bio
0 gene1 x y 2.333333 1.0
1 gene2 x y 6.000000 2.0
You're pretty close.
df['avg'] = df.iloc[:, 2:].mean(axis=1)
will get you this:
x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1 avg
gene1 x y 2 3 2 1 2.0
gene2 x y 5 7 6 2 5.0
If you wish to get the mean from different sets of columns, you could do something like this:
for col in range(10):
df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)
If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.
I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?