Pandas individual item using index and column - python

I have a csv file test.csv. I am trying to use pandas to select items dependent on whether the second value is above a certain value. Eg
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
So what i would like is if B is larger than 50 then give me the values in A as an integer which I could assign a variable to
edit 1:
Sorry for the poor explanation. The final purpose of this is that I want to look in table 1:
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
for any values above 50 in column B and get the column A value and then look in table 2:
index A B
5 44 12
6 45 13
7 46 14
8 47 15
9 48 16
so in the end i want to end up with the value in column B of table two which i can print out as an integer and not as a series. If this is not possible using panda then ok but is there a way to do it in any case?

You can use dataframa slicing, to get the values you want:
import pandas as pd
f = pd.read_csv('yourfile.csv')
f[f['B'] > 50].A
in this code
f['B'] > 50
is the condition, returning a booleans array of True/False for all values meeting the condition or not, and then the corresponding A values are selected
This would be the output:
2 46
3 47
Name: A, dtype: int64
Is this what you wanted?

Related

Compare two columns in the same dataframe and find witch row of the first column match what row from the 2nd row

I've been trying to figure out how to compare two columns that share some values between them, but at different rows.
For example
col_index
col_1
col_2
1
12
34
2
16
42
3
58
35
4
99
60
5
2
12
12
35
99
In the above example, col_1 and col_2 match on several occasions: e.g. values '12' and '99'.
I need to be able to find which rows these match at so that I can get the result of col_index.
What would be the best way to do that?
IIUC only row 2 should be removed from col_index.
You can use np.intersect1d to find the common values between the two columns and then check if these values are in your columns using isin:
import numpy as np
common_values = np.intersect1d(df.col_1,df.col_2)
res = df[(df.col_1.isin(common_values))|(df.col_2.isin(common_values))]
res
col_index col_1 col_2
0 1 12 34 # 12
2 3 58 35 # 35
3 4 99 60 # 99
4 5 2 12 # 12
5 12 35 99 # 99
res[['col_index']]
col_index
0 1
2 3
3 4
4 5
5 12
You could use isin method to get a mask, and then use it to filter the matches. Finally, you get the col_idex column and that's all. So, using your dataframe:
mask = df.col_1.isin(df.col_2)
print(df[mask].col_index.to_list()) #to_list is only to get a python list from a Serie.
Result: [1, 4, 12]
Simply loop over the values that are present in both columns, using the Series.isin method
# test data:
a = 12,16,58,99
b = 34,99,35,12
c = 1,2,3,5
d = pd.DataFrame({"col_1":a, "col_2":b, 'col_idx':c})
# col_1 col_2 col_idx
#0 12 34 1
#1 16 99 2
#2 58 35 3
#3 99 12 5
for _,row in d.loc[d.col_1.isin(d.col_2)].iterrows():
val = row.col_1
idx1 = row.col_idx
print(val, idx1, d.query("col_2==%d" % val).col_idx.values)
#12 1 [5]
#99 5 [2]
If your values are strings (instead of integers as in this example), change the query argument accordingly: query("col_2=='%s'" % val) .

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas

There are two columns in the data frame and am trying to remove the consecutive element from column "a" and its corresponding element from column "b" while keeping only the last element.
import pandas as pd
a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df=df.drop_duplicates(subset='Patch', keep="last")
df = df.set_index('Patch')
print (df)
when I run this I get:
Reward
Patch
6 54
7 12
5 75
4 55
1 87
8 46
9 98
however, what I want is:
Patch Reward
5 45
6 54
7 12
5 75
4 55
1 87
8 46
9 98
PS: I don't want the duplicate elements repeating after another element or later in the series to be removed, but remove only consecutive duplicates while keeping the last to appear in the consecutive appearance.
I also don't want it to be sorted, they should appear in the same sequence as in the list.
You can create a new column assigning an id to each group of consecutive elements and then doing the groupby operation followed by last aggregation.
a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df["group_id"]=(df.Patch != df.Patch.shift()).cumsum()
df = df.groupby("group_id").last()
Output
Patch Reward
5 45
6 54
7 12
5 75
4 55
1 87
8 46
9 98

Pandas : While adding new rows, its replacing my existing dataframe values? [duplicate]

This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.

Multiply each element of a column by each element of a different dataframe

I have two data frame both having same number of columns but the first data frame has multiple rows and the second one has only one row but same number of columns as the first one. I need to multiply the entries of the first data frame with the second by column name.
DF:1
A B C
0 34 54 56
1 12 87 78
2 78 35 0
3 84 25 14
4 26 82 13
DF:2
A B C
0 2 3 1
Result
A B C
68 162 56
24 261 78
156 105 0
168 75 14
52 246 13
This will work. Here we are manipulating numpy array inside the DataFrame.
pd.DataFrame(df1.values*df2.values, columns=df1.columns, index=df1.index)
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: x * df2[col])
theres probably an even simpler solution. Play around with apply, map and transform

Python Pandas: Select Multiple Cell Values of one column based on the Value of another Column

So my data, in Pandas, looks like this:
values variables
134 1
12 2
43 1
54 3
16 2
And I want to create a new column which is the sum of values whenever the rest of variables does not equal the variable of the current row in variables. For example, for the first row, I would want to sum all the rows of values where variables != 1. The result would look like this:
values variables result
134 1 82
12 2 231
43 1 82
54 3 205
16 2 231
I've tried a couple things like enumerate, but I can't seem to get a good handle on this. Thanks!
Instead of finding the sum of all values that aren't equal to the current variable, you can equivalently subtract the sum of all values that are equal to the current variable from the total sum without any filters:
df['result'] = df['values'].sum()
df['result'] -= df.groupby('variables')['values'].transform('sum')
Or in a single line if you want to be terse:
df['result'] = df['values'].sum() - df.groupby('variables')['values'].transform('sum')
The resulting output:
values variables result
0 134 1 82
1 12 2 231
2 43 1 82
3 54 3 205
4 16 2 231

Categories