Match Value and Get its Column Header in python

Match Value and Get its Column Header in python - python

Sample
I have 1000 by 6 dataframe, where A,B,C,D were rated by people on scale of 1-10.
In SELECT column, I have a value, which in all cases is same as value in either of A/B/C/D.
I want to change value in 'SELECT' to name of column to which it matches. For example, for ID 1, SELECT = 1, and D = 1, so the value of select should change to D.
import pandas as pd
df = pd.read_excel("u.xlsx",sheet_name = "Sheet2",header = 0)
But I am lost how to proceed.

Gwenersl solution compare all columns without ID and SELECT filtered by difference with DataFrame.eq (==), check first True value by idxmax and also if not exist matching value is set value no match with numpy.where:
cols = df.columns.difference(['ID','SELECT'])
mask = df[cols].eq(df['SELECT'], axis=0)
df['SELECT'] = np.where(mask.any(axis=1), mask.idxmax(axis=1), 'no match')
print (df)
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Detail:
print (mask)
A B C D
0 False False False True
1 False False True False
2 False False True False

Assuming the values in A, B, C, D are unique in each row with respect to SELECT, I'd do it like this:
>>> df
ID A B C D SELECT
0 1 4 9 7 1 1
1 2 5 7 2 8 2
2 3 7 4 8 6 8
>>>
>>> df_abcd = df.loc[:, 'A':'D']
>>> df['SELECT'] = df_abcd.apply(lambda row: row.isin(df['SELECT']).idxmax(), axis=1)
>>> df
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C

Use -
df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1]
Output
ID A B C D SELECT SELECT2
0 1 4 9 7 1 1 D
1 2 5 7 2 8 2 C
2 3 7 4 8 6 8 C

Related

How to add occurrence of each entry to pandas data frame?

Let df1 be a pandas data frame with a column of letters and a column of integers:
>>> k = pd.DataFrame({
"a": numpy.random.choice([i for i in "abcde"], 10),
"b": numpy.random.choice(range(5), 10)
})
>>> k
a b
0 a 1
1 c 2
2 e 1
3 b 3
4 c 2
5 d 2
6 e 2
7 c 3
8 b 0
9 a 3
Using value_counts(), the counts of the letters are found:
>>> counts = k["a"].value_counts()
>>> counts
c 3
e 2
b 2
a 2
d 1
Name: a, dtype: int64
How to add each occurrance to the respective row? It should result in
>>> k
a b count
0 a 1 2
1 c 2 3
2 e 1 2
[...]
9 a 3 2

Here's an alternate to using transform:
First, you can extract the value_counts() into a dataframe:
mycounts = k['a'].value_counts().rename_axis('a').reset_index(name = 'counts')
The step above is useful in many different scenarios (and good to know in general).
Then, a left-join will put the value counts into the original dataframe:
k = k.merge(mycounts, left_on = 'a', right_on = 'a', how = 'left')

You can try with transform
k['count']=k.groupby('a').a.transform('count')
k
Out[330]:
a b count
0 d 1 2
1 e 3 3
2 e 3 3
3 d 3 2
4 b 4 4
5 b 1 4
6 b 0 4
7 a 2 1
8 b 0 4
9 e 4 3

How to update the last column value in all the rows in csv file using python(pandas)

I am trying to update last column value for all the rows in the csv file using Pandas. but while updating the value, other column value are dropping.
file = r'Test.csv'
# Read the file
df = pd.read_csv(file, error_bad_lines=False)
# df.at[3, "ingestion"] = '20'
df.set_value(1, "ingestion", '30')
df.to_csv("Test.csv", index=False, sep='|')

Use DataFrame.iloc with -1 for select last column and : for select all rows:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df.iloc[:, -1] = '20'
print (df)
A B C D E F
0 a 4 7 1 5 20
1 b 5 8 3 3 20
2 c 4 9 5 6 20
3 d 5 4 7 9 20
4 e 5 2 1 2 20
5 f 4 3 0 4 20
EDIT:
For update all rows by last edit value swap -1 with : and get last column value by DataFrame.iat:
df.iloc[-1, :] = df.iat[-1, -1]
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 b b b b b b

pd.DataFrame.set_value is not appropriate for setting all the values in a column. As per the docs, it is used to setting a scalar at a specific row and column label combination.
Moreover, since v0.21, it has been deprecated in favour of .at / .iat accessors.
Instead, you can set the value directly by extracting the final column label, assuming you have no duplicate column names:
df[df.columns[-1]] = '20'
Or, more directly, you can use the iloc accessor:
df.iloc[:, -1] = '20'

python - sum list of columns, even if not all there

I have a dataframe that looks like this
A B C D G
0 9 5 7 6 1
1 1 4 7 3 1
2 8 4 1 3 1
generated by this:
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
x=np.array([[1,2]])
df['G'] = np.repeat(x,5)
Suppose there are times when a certain column 'E' exists, and sometimes it doesn't depending on the time frame of the data.
So sometimes we have
A B C D E G
0 9 5 7 6 2 1
1 1 4 7 3 3 1
2 8 4 1 3 4 1
So either way, I'd like to sum the rows from columns A, C, and E, and groupby column G. So when column E exists , I just use
df.groupby('G')['A', 'C', 'E'].sum()
but when E doesn't exist, like in the first dataframe, it doesn't work.
What do I need to do in order to sum even if a column is missing?

You could store the columns you wish to sum in a list sum_cols = list('ACE'), and then intersect whatever DataFrame you're working with with this list.
df.groupby('G')[df.columns.intersection(sum_cols)].sum()
Demo
>>> df = pd.DataFrame(np.random.randint(0, 10, (2, 5)),
columns=list('ABCDG'))
>>> df
A B C D G
0 9 5 9 2 6
1 3 1 1 1 3
>>> sum_cols = list('ACE')
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C
G
3 3 1
6 9 9
>>> df['E'] = [100, 200]
>>> df.groupby('G')[df.columns.intersection(sum_cols)].sum()
A C E
G
3 3 1 200
6 9 9 100

Split one column to two columns depending one the content in pandas dataframe

I have a pandas DataFrame like this:
df = pd.DataFrame(['A',1,2,3,'B',4,5,'C',6,7,8,9])
0
0 A
1 1
2 2
3 3
4 B
5 4
6 5
7 C
8 6
9 7
10 8
11 9
It's mix of strings and numbers. I want to split this DF into tow columns like this:
name value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 C 6
6 C 7
7 C 8
8 C 9
what's an efficient way to do this?

You can use:
df = pd.DataFrame({0 :['A',1,2,3,'B',4,5,'C',6,7,8,9]})
#check strings
mask = df[0].astype(str).str.isalpha()
#check if mixed values - numeric with strings
#mask = df[0].apply(lambda x: isinstance(x, str))
#create column to first position, create NaNs filled by forward filling
df.insert(0, 'name', df[0].where(mask).ffill())
#remove rows with same values - with names, rename column
df = df[df['name'] != df[0]].rename(columns={0:'value'}).reset_index(drop=True)
print (df)
name value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 C 6
6 C 7
7 C 8
8 C 9
Or:
out = []
acc = None
for x in df[0]:
#check if strings
if isinstance(x, str):
#assign to variable for tuples
acc = x
else:
#append tuple to out
out.append((acc, x))
print (out)
df = pd.DataFrame(out, columns=['name','value'])
print (df)
name value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 C 6
6 C 7
7 C 8
8 C 9

IIUC
df['New']=df[df.your.str.isalpha().fillna(False)]
df.ffill().loc[df.your!=df.New,:]
Out[217]:
your New
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 6 C
9 7 C
10 8 C
11 9 C
Data input
df = pd.DataFrame({'your':['A',1,2,3,'B',4,5,'C',6,7,8,9]})

This will give you the data structure to get what you want:
input = ['A',1,2,3,'B',4,5,'C',6,7,8,9]
letter = None
output = []
for i in input:
if type(i) is type(''):
letter = i
elif type(i) is type(0) and letter is not None:
output.append((letter, i))
print(output)
Output now has a sequence of tuples, paired as you wish. I don't use pandas.

How to exclude values from pandas dataframe?

I have two dataframes:
1) customer_id,gender
2) customer_id,...[other fields]
The first dataset is an answer dataset (gender is an answer). So, I want to exclude from the second dataset those customer_id which are in the first dataset (which gender we know) and call it 'train'. The rest records should become a 'test' dataset.

I think you need boolean indexing and condition with isin, inverting boolean Series is by ~:
df1 = pd.DataFrame({'customer_id':[1,2,3],
'gender':['m','f','m']})
print (df1)
customer_id gender
0 1 m
1 2 f
2 3 m
df2 = pd.DataFrame({'customer_id':[1,7,5],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
B C D E F customer_id
0 4 7 1 5 7 1
1 5 8 3 3 4 7
2 6 9 5 6 3 5
mask = df2.customer_id.isin(df1.customer_id)
print (mask)
0 True
1 False
2 False
Name: customer_id, dtype: bool
print (~mask)
0 False
1 True
2 True
Name: customer_id, dtype: bool
train = df2[mask]
print (train)
B C D E F customer_id
0 4 7 1 5 7 1
test = df2[~mask]
print (test)
B C D E F customer_id
1 5 8 3 3 4 7
2 6 9 5 6 3 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match Value and Get its Column Header in python - python

Use - df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1] Output ID A B C D SELECT SELECT2 0 1 4 9 7 1 1 D 1 2 5 7 2 8 2 C 2 3 7 4 8 6 8 C

Related

How to add occurrence of each entry to pandas data frame?

How to update the last column value in all the rows in csv file using python(pandas)

python - sum list of columns, even if not all there

Split one column to two columns depending one the content in pandas dataframe

How to exclude values from pandas dataframe?

Categories

Resources