iterate in numpy array over rows just in column - python

I want to iterate over a column in a numpy array (interpol_values_array) but only in that specific column to find the place where a value (mole percentage) to put.
column = interpol_values_array[:][-2]
for column in interpol_values_array:
place = []
place_array = np.array(place)
place_array = np.searchsorted([interpol_values_array], mole_percentage,
side="right")
place_array should return me the index, where to put my value (mole_percentage)
Is that way a possible way? Further, is there a way using np.nditer, which could be a far more elegant?

Related

Taking the column names from the first row that has less than x Nan's

I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
print(df_cities.iloc[0:20,])
The column names can be found in row 15, but I would like this row number to be automatically determined. I thought the best way would be to take the first row for which the values are non-Nan for less than 10 items.
I combined this answer, to find this answer to do the following:
amount_Nan = df_cities.shape[1] - df_cities.count(axis=1)
# OR df.isnull().sum(axis=1).tolist()
print(amount_Nan)
col_names_index = next(i for i in amount_Nan if i < 3)
print(col_names_index)
df_cities.columns = df_cities.iloc[col_names_index]
The problem is that col_names_index keeps returning 0, while it should be 15. I think it is because amount_Nan returns rows and columns because of which next(i for i in amount_Nan if i < 3) works differently than expected.
The thing is that I do not really understand why. Can anyone help?
IIUC you can get first index of non missing value per second column by DataFrame.iloc with Series.notna and Series.idxmax, set columns names by this row and filter out values before this row by index:
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]

In python How to ensure that seed in using randint keeps changing when i am trying to pick a random number?

def claims(dataframe):
dataframe.loc[(dataframe.severity ==1),'claims_made']= randint(200, 20000)
return dataframe
here 'severity' is an existing column and 'claims_made' is a new column, I want to have the randint keep picking different values that are being assigned to the 'claims_made' column. because for now it's just picking one random value out of the bucket specified and is assigning the same value to all the rows that satisfy the condition
Your code gets a single randint and applies that one value to the column you create. Its the same as if you had done
val = randint(20, 20000)
dataframe.loc[(dataframe.severity ==1),'claims_made']= val
Instead you could get an index of the rows you want to assign. Use it to create a series of random integers and when you assign that back to the dataframe, non-indexed rows become NaN.
import pandas as pd
import numpy as np
def claims(dataframe):
wanted_index = dataframe[df.severity==1].index
dataframe["claims_made"] = pd.Series(
np.random.randint(20,20000, size=len(wanted_index)),
index=wanted_index)
return dataframe
df = pd.DataFrame({"severity":[1, 1, 0, 8, -1, 99, 1]})
print(claims(df))
If you want to stick with your existing approach, you could do something like this:
def claims2(df):
n_rows = len(df.loc[(df.severity==1), 'claims_made'])
vals = [randint(200, 20000) for _ in range(n_rows)]
df.loc[(df.severity==1), 'claims_made'] = vals
return df
p.s. I'd recommend accessing columns via df['severity'] instead of df.severity -- you can get into trouble using the . syntax if you have a dataset with spaces etc. in the column names.
I'll give you a broad hint; coding is up to you.
Form a series (a temporary column object) of random numbers in the desired range. Assign that series to your data frame column. You can find examples of this technique in any tutorial on data frames.

apply max to varying-dimension subsets of pandas dataframe

For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index.
I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all
multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead
I'm looking for this
targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF
Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):
multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")
The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:
multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue
So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.
So using your original approach:
for i in uniqueIndices:
matchingIndices = multiindexDF['theIndex'] == i
maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue
(Notice I've also changed the first .loc where you were incorrectly using the boolean index)

Efficient way of finding the minimum/maximum row index for each column in a numpy.array with nonzero value?

I have a 2D numpy array iarr coming from a single color of a picture.
I want to find the minimum/maximum row index in each column with a nonzero value. If there are no nonzero values in a column this column doesn't need to be considered.
I have a working solution but it is very slow. My current solution is this
img = Image.open('nameofimage.jpg')
iarr = numpy.array(img)[:,:,0]
nonz = numpy.nonzero(iarr)
colinds = numpy.unique(nonz[1])
minrowinds = numpy.array([numpy.min(nonz[0][nonz[1]==cind]) for cind in colinds])
Thanks to yatu's pointer, I can now answer this myself.
colinds = numpy.unique(nonz[1])
minrowinds = numpy.argmax((iarr>0),axis=0)[colinds]
For the maximum indices I had to flip the array first, as np.argmax always gives the first occurrence of the maximum value.
maxrowinds = numpy.argmax(numpy.flip((iarr>0),0),axis=0)[colinds]
maxrowinds = iarr.shape[0] - maxrowinds

Array reclassification with numpy

I have a large (50000 x 50000) 64-bit integer NumPy array containing 10-digit numbers. There are about 250,000 unique numbers in the array.
I have a second reclassification table which maps each unique value from the first array to an integer between 1 and 100. My hope would be to reclassify the values from the first array to the corresponding values in the second.
I've tried two methods of doing this, and while they work, they are quite slow. In both methods I create a blank (zeros) array of the same dimensions.
new_array = np.zeros(old_array.shape)
First method:
for old_value, new_value in lookup_array:
new_array[old_array == old_value] = new_value
Second method, where lookup_array is in a pandas dataframe with the headings "Old" and "New:
for new_value, old_values in lookup_table.groupby("New"):
new_array[np.in1d(old_array, old_values)] = new_value
Is there a faster way to reclassify values
Store the lookup table as a 250,000 element array where for each index you have the mapped value. For example, if you have something like:
lookups = [(old_value_1, new_value_1), (old_value_2, new_value_2), ...]
Then you can do:
idx, val = np.asarray(lookups).T
lookup_array = np.zeros(idx.max() + 1)
lookup_array[idx] = val
When you get that, you can get your transformed array simply as:
new_array = lookup_array[old_array]

Categories