Applying Operation to Pandas column if other column meets criteria - python

I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.

There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2

Related

Remake dataframe based of fuzzywuzzy matches

i have a dataframes now it have 5 rows(in future will have more). In column names there 5 values, if those 5 names the same(their fuzz.ratio close to each other) then ok no changes needed.
But there is cases where:
4 values good(their fuzz.ratio close) and 1 value different, bad.
3 values good, 2 bad,
3 values good, 1 bad and 1 bad.
2 values the same, other 2 the same, and 1 different, bad.
2 values the same, other 1 and 1 and 1 values bad.
So I need dataframes where at least 2 rows the same, 3 better, 4 good, 5 the best.
Here is some simple example, of course series will have row index based on that it will be easier to select needed rows.
fruits_4_1 = ['banana', 'bananas', 'bananos', 'banandos','cherry']
fruits_3_2 = ['tomato','tamato','tomatos','apple','apples']
fruits_3_1_1 = ['orange','orangad','orandges','ham', 'beef']
fruits_2_2_1 = ['kiwi', 'kiwiss', 'mango','mangas', 'grapes']
fruits_2_1_1_1 = ['kiwi', 'kiwiss', 'mango','apples', 'beefs']
for f in fruits_4_1:
score_1 = process.extract(f, fruits_2_1_1_1, limit=10, scorer=fuzz.ratio)
print(score_1)
I need implement logic, that will check dataframe`s series and determine what type it is 4+1\3+2 etc. And based on that will create new dataframes, with only similar rows. How do i do that?

How to find the number of an element in a column of a dataframe

For example, I have a dataframe A likes below :
a b c
x 0 2 1
y 1 3 2
z 0 2 4
I want to get the number of 0 in column 'a' , which should returns 2. ( A[x][a] and A[z][a] )
Is there a simple way or is there a function I can easily do this?
I've Googled for it, but there are only articles like this.
count the frequency that a value occurs in a dataframe column
Which makes a new dataframe, and is too complicated to what I only need to do.
Use sum with boolean mask - Trues are processes like 1, so output is count of 0 values:
out = A.a.eq(0).sum()
print (out)
2
Try value_counts from pandas (here):
df.a.value_counts()["0"]
If the values are changeable, do it with df[column_name].value_counts()[searched_value]

Removing a decimal from a float then compare to another value

I have a dataframe that I'm working with in pandas. I have two columns that I want to determine if they're not equal. An example of the data is as follows:
A B Compare
1002 3.1 31 Not Equal
1003 5 5
1004 1 3 Not Equal
I want rows like the first one (1002) to show as equal because they contain the same numbers. Both columns A and B are float64 data types.
I have tried the following:
df['column_a'].replace('.','')
And I've also attempted to find a way to multiply a number by 10 on the condition that the value is not an integer (3.1, 2.2, 1.4, etc).
I believe I could also accomplish the same desired end result by taking all values that are greater than 5 in column B and divide them by 10. I only care about values 0 through 5. The only values I'm going to see above 5 can be divided by 10.
This is what I tried doing to accomplish that but I get an error (TypeError: invalid type comparison):
df['column_b'] = np.where(df['column_b'] > 5, /10,'')
What would be the best way to make the values equal in column A and B for row 1002?
This is worth the try:
df['Compare'] = df['A'].str.replace(".","").astype(int).eq(df['B'])
You were going in the right direction, just add astype and use .eq() ..

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Pandas: conditionally matching rows in a smaller data frame

(update: added desired data frame)
Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.
Questions that address similar problems, but don't solve my particular problem include:
Efficiently select rows that match one of several values in Pandas DataFrame
Efficiently adding calculated rows based on index values to a pandas DataFrame
Compare Python Pandas DataFrames for matching rows
The Question
Let's say I have a data frame with many columns that I am working on:
big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
match_1
0 11
1 12
2 51
3 52
I also have smaller data frame that, in theory, maps some conditional statement to a desired value:
# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
is_even score
0 True 10
1 False 200
The goal here would be to use a conditional statement to match each row in big to a single row in small. Assume that small is constructed such that there was always be one, and only one, match for each row in big. (If there has to be multiple rows in small that match, just pick the first one.)
The desired output would be something like:
desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
match_1 metric
0 11 200
1 12 10
2 51 200
3 52 10
I'm pretty sure that the syntax would look similar to:
big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']
This won't work because small['is_even'] is a Series of length 2, while ( (big['match_1'] / 2) == 0) is a Series of length 4. What I'm looking to do is, for each row in big, find the one row in small that matches based on a conditional.
If I can get a sequence that contains the correct row in small that matches each row in big, then I could do something like:
`big['score'] = small.loc[matching_rows, 'score']
The question I have is: how do I generate the Sequence matching rows?
Things that (I think) aren't quite what I want:
If the columns in big and small were to match simply on constant values, this would be a straight forward use of either big.merge() or big.groupby(), however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:
(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])
Solutions that rely on isin(), any(), etc, don't work, because the conditional check can be arbitrarily complex.
I could certainly create a function to apply() to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.
The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply(), but I could swear that there was a way to do what I've described above.
One approach is to use a merge in which the on_left is not a column, but a vector of keys. It's made simpler by setting the index of small to be is_even:
>>> small.set_index('is_even', inplace=True)
>>> condition = big['match_1'] % 2 == 0
>>> pd.merge(big, small, left_on=condition, right_index=True, how='left')
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
You can index small with True and False and just do a straight .ix lookup on it. Not sure it's all that much tidier than the intermediate column/merge:
In [127]: big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
In [128]: small = pd.DataFrame({'score': [10, 200]}, index=[True, False])
In [129]: big['score'] = small.ix[pd.Index(list(big.match_1 % 2 == 0))].score.values
In [130]: big
Out[130]:
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10

Categories