Select rows based on condition and set values from a vector

Select rows based on condition and set values from a vector - python

I want to set the entire rows to a value from a vector, if a condition in on column is met.
import pandas as pd
df = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 1, 1]], columns=('one', 'two', 'three'))
vector = pd.Series([2,3,4])
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 1 1
I want the result to be like this:
df_wanted = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 4, 4]], columns=('one', 'two', 'three'))
print(df_wanted)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
I tried this but it gives me error:
df.loc[df['one']=='b'] = vector[df['one']=='b']
ValueError: Must have equal len keys and value when setting with an iterable
// m.

You can specify columns in list for set:
df.loc[df['one']=='b', ['two', 'three']] = vector[df['one']=='b']
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
Or if need more dynamic solution - select all numeric columns:
df.loc[df['one']=='b', df.select_dtypes(np.number).columns] = vector[df['one']=='b']
Or compare only once and assign to variable:
m = df['one']=='b'
df.loc[m, df.select_dtypes(np.number).columns] = vector[m]

Related

Remove all the rows having same column values of another column which is duplicated

1.Input: we have a dataframe
ID name
1 a
1 b
2 a
2 c
3 d
2.Now I took the first duplicate 'name' (here it is 'a' with ID as '2') value and remove the rest, output:
ID name
1 a
1 b
2 c
3 d
Code I used:
df.loc[~df.duplicated(keep='first', subset=['name'])]
3.Now I want to remove all the rows sharing the same 'ID' ( here the 'a' removed was having '2' as ID, so we remove all rows with '2' as ID), Final Expected output : so we remove [2 c]
ID name
1 a
1 b
3 d
Code I tried: But it is not working
dt = df.name.duplicated(keep='first')
df.loc[~df.groupby(['ID','dt']).size().reset_index().drop(columns={0})]

You can use some kind of blacklist for the ID's:
Sample data:
import pandas as pd
d = {'ID':[1, 1, 2, 2, 3], 'name':['a', 'b', 'a', 'c', 'd']}
df = pd.DataFrame(d)
Code:
df[~df['ID'].isin(df[df['name'].duplicated()]['ID'])]
Output:
ID name
0 1 a
1 1 b
4 3 d
Code simplified:
blacklist = df[df['name'].duplicated()]['ID']
mask = ~df['ID'].isin(blacklist)
df[mask]

If the Dataframe is ordered by ID those two approaches should work:
df = pd.DataFrame(data={'ID': [1, 1, 1, 2, 3], 'name': ['a', 'b', 'a', 'c', 'd']})
df1 = df.loc[~df.duplicated(keep='first', subset=['ID'])]
df2 = df1.loc[~df1.duplicated(keep='first', subset=['name'])]
print(df2)
print(df.drop_duplicates(keep='first', subset=['ID']).drop_duplicates(keep='first', subset=['name']))
ID name
0 1 a
3 2 c
4 3 d
If it's order by name you should do subset=['name'] and then subset=['ID'].

How to compare and replace individual cell values in data according to a list?: Pandas

I have a dataframe containing numerical values. I want to replace all values in the dataframe by comparing individual cell values to the respective elements of the list. The length of the list and the length of the columns are the same. Here's an example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Output
a b c
0 101 2 3
1 4 500 6
2 712 8 9
list_numbers = [100,100,100]
I want to compare individual cell values to the respective elements of the list.
So, the column 'a' will be compared to 100. If the values are greater than hundred, I want to replace the values with another number.
Here is my code so far:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df_columns = df.columns
df_index = df.index
#Creating a new dataframe to store the values.
df1 = pd.DataFrame(index= df_index, columns = df_columns)
df1 = df1.fillna(0)
for index, value in enumerate(df.columns):
#df.where replaces values where the condition is false
df1[[value]] = df[[value]].where(df[[value]] > list_numbers [index], -1)
df1[[value]] = df[[value]].where(df[[value]] < list_numbers [index], 1)
#I am getting something like: nan for column a and error for other columns.
#The output should look something like:
Output
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

Iterating over a DataFrame iterates over its column names. So you could simply do:
df1 = pd.DataFrame()
for i, c in enumerate(df):
df1[c] = np.where(df[c] >= list_numbers[i], 1, -1)

You can avoid iterating over the columns, and use numpy broadcasting (which is more efficient):
df1 = pd.DataFrame(
np.where(df.values > np.array(list_numbers), 1, -1),
columns=df.columns)
df1
Output:
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

pandas: conditional select using .loc with MultiIndex

I have read doc of Advanced indexing with hierarchical index where using .loc for MultiIndex is explained. Also this thread: Using .loc with a MultiIndex in pandas?
Still I don't see how select rows where (first index == some value) or (second index == some value)
Example:
import pandas as pd
index = pd.MultiIndex.from_arrays([['a', 'a', 'a', 'b', 'b', 'b'],
['a', 'b', 'c', 'a', 'b', 'c']],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1,2,3,4,5,6], 'y': [6,5,4,3,2,1]}, index=index)
Is this DataFrame:
x y
i0 i1
a a 1 6
b 2 5
c 3 4
b a 4 3
b 5 2
c 6 1
How can I get rows where i0 == 'b' or i1 == 'b'?
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

I think the easier answer is to use the DataFrame.query function which allows you to query the multi-index by name as follows:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
df.query('i0 == "b" | i1 == "b"')
returns:
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

Use get_level_values()
>>> mask = (df.index.get_level_values(0)=='b') | (df.index.get_level_values(1)=='b')
>>> df[mask] # same as df.loc[mask]
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1

This might be possible with some logical condition on the index columns i0 and i1 unsing .loc. However to me using .iloc seems easier:
You can get the iloc index via pd.MultiIndex.get_locs.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
idx0 = index.get_locs(['b', slice(None)]) # i0 == 'b' => [3, 4, 5]
idx1 = index.get_locs([slice(None), 'b']) # i1 == 'b' => [1, 4]
idx = np.union1d(idx0, idx1)
print(df.iloc[idx])
will yield
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
Note:
slice(None) means the same as [:] in index-slicing.

It is quite some time since this question was raised. After reading the answers available, however, I do see the benefit of adding my response which is going to answer original query exactly and how do it efficiently with minimum coding.
To select multiple indices as in your question, you can do :
df.loc[('b','b')]
Please note most critical point here is to use parenthesis () for indices. This will give an output :
x 5
y 2
Name: (b, b), dtype: int64
You can further add column name ('x' in my case) as if needed by doing as below:
df.loc[('b','b'),'x']
This will give output:
5
Entire process is in the attached image.

how to re-arrange multiple columns into one column with same index

I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.

This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])

try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]

Nested List to Pandas Dataframe with headers

Basically I am trying to do the opposite of How to generate a list from a pandas DataFrame with the column name and column values?
To borrow that example, I want to go from the form:
data = [['Name','Rank','Complete'],
['one', 1, 1],
['two', 2, 1],
['three', 3, 1],
['four', 4, 1],
['five', 5, 1]]
which should output:
Name Rank Complete
One 1 1
Two 2 1
Three 3 1
Four 4 1
Five 5 1
However when I do something like:
pd.DataFrame(data)
I get a dataframe where the first list should be my colnames, and then the first element of each list should be the rowname
EDIT:
To clarify, I want the first element of each list to be the row name. I am scrapping data so it is formatted this way...

One way to do this would be to take the column names as a separate list and then only give from 1st index for pd.DataFrame -
In [8]: data = [['Name','Rank','Complete'],
...: ['one', 1, 1],
...: ['two', 2, 1],
...: ['three', 3, 1],
...: ['four', 4, 1],
...: ['five', 5, 1]]
In [10]: df = pd.DataFrame(data[1:],columns=data[0])
In [11]: df
Out[11]:
Name Rank Complete
0 one 1 1
1 two 2 1
2 three 3 1
3 four 4 1
4 five 5 1
If you want to set the first column Name column as index, use the .set_index() method and send in the column to use for index. Example -
In [16]: df = pd.DataFrame(data[1:],columns=data[0]).set_index('Name')
In [17]: df
Out[17]:
Rank Complete
Name
one 1 1
two 2 1
three 3 1
four 4 1
five 5 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select rows based on condition and set values from a vector - python

Related

Remove all the rows having same column values of another column which is duplicated

How to compare and replace individual cell values in data according to a list?: Pandas

pandas: conditional select using .loc with MultiIndex

how to re-arrange multiple columns into one column with same index

Nested List to Pandas Dataframe with headers

Categories

Resources