I would like to ask a question for a numpy array below.
I have a dataset, which has 50 rows and 15 columns and I created a numpy array as such:
I want to compare rows with each other (except than itself), then found the number of rows which satisfies following condition:
there is no other row that
-values are both smaller
-if one is equal, the other one should be smaller
Money Weight
10 80
20 70
30 90
25 50
35 10
40 60
50 10
for instance for row 1: there is no other row which are both smaller on two columns, if one is smaller on the other column row 1 is smaller on the other. Satisfies the condition
for row 3: there is no other row which are both smaller on two columns, it is equal on column weight with row 6 but on money dimension it is smaller. Satisfies the condition
for row 6: there is no other row which are both smaller on two columns. it is equal on weight dimension with row 3 but the value in money is greater. Does not satisfy the condition
Following code works perfect for me to find the rows in the given table:
mask = (arr <= arr[:, None]).all(2).sum(1) < 2
res = df[mask]
print(res)
But I need also compare the table with another row outside the table and should find True or False
For instance:
compare([40,10],table) =False
compare([10,70],table) =True
I have tried bunch of ways to find a proper solution, but could not find a proper way.
I appreciate any suggestions!
Related
I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967
I want to find all the combinations of a binary matrix (ones and zeros) of size 18 x 9 where each row is equal to 5 and each column is equal to 10.
Also each block must have a 1 in each column.
The total number of combinations of that grid size is... well, too much to iterate over:
2 ** (18 x 9) combinations = 5,846,006,549,323,611,672,814,739,330,865,132,078,623,730,171,904
Although there are only 9!/(5!4!)=126 combinations of rows to make a row equal 5. With 18 rows, that's still a lot 64,072,225,938,746,379,480,587,511,979,135,205,376
However, each block must have at least a 1 in each column which must limit the number of combinations.
I wonder if I can break it down in to block combinations so it's potentially 6 blocks of 9 columns... which is then only 18,014,398,509,481,984 (obviously didn't factor in the work to work out the blocks first)
I figure the power of numpy has the ability but I can't work it out.
I have done a couple of examples in Excel by hand
Binary matrix with row and column sum constraint.
solve(4^3^2^x - 2^162 == 0, x)
I am trying to compare all rows within a group to check if a condition is fulfilled. If the condition is not fulfilled, I set the new column to True, else False. The issue I am having is finding a neat way to compare all rows within each group. I have something that works but will not work where there are a lot of rows in a group.
for i in range(8):
n = -i-1
cond=(((df['age']-df['age'].shift(n))*(df['weight']-df['weight'].shift(n)))<0)&(df['ref']==df['ref'].shift(n))&(df['age']<7)&(df['age'].shift(n)<7)
df['x'+i] = cond.groupby(df['ref']).transform('any')
df.loc[:,'WFA'] = 0
df.loc[(df['x0']==False)&(df['x1']==False)&(df['x2']==False)&(df['x3']==False)&(df['x4']==False)&(df['x5']==False)&(df['x6']==False)&(df['x7']==False),'WFA'] = 1
To iterate through each row, I have created a loop that compares adjacent rows (using shift). Each loop represents the next adjacent row. In effect, I am able to compare all rows within a group where the number of rows within a group is 8 or less. As you can imagine, it becomes pretty cumbersome as the number of rows grows large.
Instead of creating of column for each period in shift, I want to see if any row matches the condition with any other row. Then set the new column 'WFA' True or False.
If anyone is interested, I post the answer to my own question here (although it is very slow):
df.loc[:,'WFA'] = 0
for ref, gref in df.groupby('ref'):
count=0
for r_idx, row in gref.iterrows():
cond = ((((row['age']-gref.loc[gref['age']<7, 'age'])*(row['weight']-gref.loc[gref['age']<7, 'weight']))<0).any())&(row['age']<7)
if cond==False:
count+=1
if count==len(gref):
df.loc[df['ref']==ref, 'WFA'] = 1
let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.
I have a pandas dataframe and I can select a column I want to look at with:
column_x = str(data_frame[4])
If I print column_x, I get:
0 AF1000g=0.09
1 AF1000g=0.00
2 AF1000g=0.14
3 AF1000g=0.02
4 AF1000g=0.02
5 AF1000g=0.00
6 AF1000g=0.54
7 AF1000g=0.01
8 AF1000g=0.00
9 AF1000g=0.04
10 AF1000g=0.00
11 AF1000g=0.03
12 AF1000g=0.00
13 AF1000g=0.02
14 AF1000g=0.00
...
I want to count how many rows that contain the values AF1000g=0.05 or less there are. As well as rows that contain the values AF1000g=0.06 or greater.
Less_than_0.05 = count number of rows with AF1000g=0.05 and less
Greater_than_0.05 = count number of rows with AF1000g=0.06 and greater
How can I count these values from this column when the value in the column is a String that contains string and numeric content?
Thank you.
Rodrigo
You can use apply to extract the numerical values, and do the counting there:
vals = column_x.apply(lambda x: float(x.split('=')[1]))
print sum(vals <= 0.05) #number of rows with AF1000g=0.05 and less
print sum(vals >= 0.06) #number of rows with AF1000g=0.06 and greater
The comment above makes a good point. Generally you should focus on parsing before analyzing.
That said, this isn't too hard. Use pd.Series.str.extract with a regex, then force to a float, then do operations on that.
floats = column_x.str.extract("^AF1000g=(.*)$").astype(float)
num_less = (vals <= 0.05).sum()
num_greater = (vals > 0.05).sum()
This takes advantage of the fact that the boolean array returned by the comparison with vals can be forced to 0s and 1s.