Totalling the matching values of a dataframe column with Series values - python

I have a Series :
350 0
254 1
490 0
688 0
393 1
30 1
and a dataframe :
0 outcome
0 350 1
1 254 1
2 490 0
3 688 0
4 393 0
5 30 1
The below code to count the total number of matches between the Series and the outcome column in the dataframe is what was intended.
Is there any other better way besides the below?
i=0
match=0
for pred in results['outcome']:
if test.values[i] == pred:
match+=1
i+=1
print match
I tried using results['Survived'].eq(labels_test).sum() but the answer is wrong.
And using lambda but the syntax is wrong.

You can compare by mapping series i.e
(df['0'].map(s) == df['outcome']).sum()
4

First, align the dataframe and series using align.
df, s = df.set_index('0').align(s, axis=0)
Next, compare the outcome column with the values in s and count the number of True values -
df.outcome.eq(s).sum()
4

Related

Length of index suggests different number of rows compared to looping through and printing index in Pandas DF

I have a Pandas data frame. The number of rows are:
len(df.index)
529
Also, shape shows:
df.shape
(529, 5)
But if I loop through:
for i in df.index:
print(i)
It prints:
0
1
2
...
728
729
732
Suggesting 732 rows. Same with iterrows():
for ind, col in df.iterrows():
print(ind)
It prints:
0
1
2
...
728
729
732
It looks like your indices are not necessarily sequential -- e.g., you have
729
732
I'm guessing you filtered a larger dataframe?
As you see index can jump, e.g. 729 to 732. Your data looks something like this:
df = pd.DataFrame(np.arange(10).reshape(5,2), index=[0,1,4,5,10])
which is:
0 1
0 0 1
1 2 3
4 4 5
5 6 7
10 8 9
The index values does not cover all integer (range) values.

Calculation within Pandas dataframe group

I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**
The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)

python pandas df: adding to one column depending on the value in that row in another column

I have a df that looks like:
a b c d
0 0 0 0 0
1 0 0 0 0
2 1 292 0 0
3 0 500 1 406
4 1 335 0 0
I would like to find the sum of column b where a=1 for that row. So in my example I would want rows 2 and 4 added (just column b), but not row 3. If it makes any difference, there are only 0s and 1s. Thanks for any help!
You need to use .loc
>>> df.loc[df.a==1, 'b'].sum()
627
You can review the docs here for indexing and selecting data.

MultiIndex pivot table pandas python

import pandas as pd
data = pd.read_excel('.../data.xlsx')
the content looks like this:
Out[57]:
Block Concentration Name Replicate value
0 1 100 GlcNAc2 1 321
1 1 100 GlcNAc2 2 139
2 1 100 GlcNAc2 3 202
3 1 33 GlcNAc2 1 86
4 1 33 GlcNAc2 2 194
5 1 33 GlcNAc2 3 452
6 1 10 GlcNAc2 1 140
7 1 10 GlcNAc2 2 285
... ... ... ... ... ...
1742 24 0 Print buffer 1 -9968
1743 24 0 Print buffer 2 -4526
1744 24 0 Print buffer 3 14246
[1752 rows x 5 columns]
Pivot table looks like this (only a part of the large table):
newdata = data.pivot_table(index=["Block", "Concentration"],columns=["Name","Replicate"], values="value")
my Questions:
how do i fill the '0' concentration of 'GlcNAc2' and 'Man5GIcNAc2' with the 'print buffer' values?
desired output:
i have been searching online and haven't really found anything similar. I have not even found a way to point to the 'print buffer' values from the 'Name' column.
from the MultiIndex/advanced indexing chapters it says to use
df.xs('one', level='second')
but it doesn't work in my case, it doesn't work with pivot table, im not sure why , i'm confused. Is a pivot table multiindex??
If I understand correctly, you want to duplicate the values with Name == Print buffer to columns with Name == 'GlcNAc2' and 'Man5GIcNAc2' and concentration = 0.
A way of doing this is to duplicate the rows in the original dataset :
selection = data[data["Name"] == "Print buffer"]'
selection.loc[:, "Name"] = "GlcNAc2"
data = pd.concat([data, selection])
selection.loc[:, "Name"] = "Man5GIcNAc2"
data = pd.concat([data, selection])
And then apply the pivot_table.
Remark : I am not sure that I understand your question. I am confused by the fact that in your pictures, the values Block == 1 change from the first picture to the second. Is it just a mistake or was that the core of your problem?

retaining order of columns after pivot

I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?

Categories