Iterate over columns using conditional in python? - python

When working with a DataFrame, is there a way to change the value of a cell based on a value in a column?
For example, I have a DataFrame of exam results that looks like this:
answer_is_a answer_is_c
0 a a
1 b b
2 c c
I want to code them as correct (1) and incorrect(0). So it would look like this:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1
So I need to iterate over the entire DataFrame, compare what is already in the cell with the last character of the column header, and then change the cell value.
Any thoughts?

By default, DataFrame.apply iterates through the columns, passing each as a series to the function you feed it. Series have a name attribute that is a string we'll use to extract the answer.
So you could do this:
from io import StringIO
import pandas
data = StringIO("""\
answer_is_a answer_is_c
a a
b b
c c
""")
x = (
pandas.read_table(data, sep='\s+')
.apply(lambda col: col == col.name.split('_')[-1])
.astype(int)
)
And x prints out as:
answer_is_a answer_is_c
0 1 0
1 0 0
2 0 1

Related

Defining an aggregation function with groupby in pandas

I would like to collapse my dataset using groupby and agg, however after collapsing, I want the new column to show a string value only for the grouped rows.
For example, the initial data is:
df = pd.DataFrame([["a",1],["a",2],["b",2]], columns=['category','value'])
category value
0 a 1
1 a 3
2 b 2
Desired output:
category value
0 a grouped
1 b 2
How should I modify my code (to show "grouped" instead of 3):
df=df.groupby(['category'], as_index=False).agg({'value':'max'})
You can use a lambda with a ternary:
df.groupby("category", as_index=False)
.agg({"value": lambda x: "grouped" if len(x) > 1 else x})
This outputs:
category value
0 a grouped
1 b 2
Another possible solution:
(df.assign(value = np.where(
df.duplicated(subset=['category'], keep=False), 'grouped', df['value']))
.drop_duplicates())
Output:
category value
0 a grouped
2 b 2

How to count the number of occurrences that a certain value occurs in a DataFrame according to another column?

I have a Pandas DataFrame that has two columns as such:
item1 label
0 a 0
1 a 1
2 b 0
3 c 0
4 a 1
5 a 0
6 b 0
In sum, there are a total of three kinds of items in the column item1. Namely, a, b, and c. The values that the entries of the label column are either 0 or 1.
What I want to do is receive a DataFrame where I have a count of how many entries in item1 have label value 1. Using the toy example above, the desired DataFrame would be something like:
item1 label
0 a 2
1 b 0
2 c 0
How might I achieve something like that?
I've tried using the following line of code:
df[['item1', 'label']].groupby('item1').sum()['label']
but the result is a Pandas Series and also displays some behaviors and properties that aren't desired.
IIUC, you can use pd.crosstab:
count_1=pd.crosstab(df['item1'],df['label'])[1]
print(count_1)
item1
a 2
b 0
c 0
Name: 1, dtype: int64
To get a DataFrame:
count_1=pd.crosstab(df['item1'],df['label'])[1].rename('label').reset_index()
print(count_1)
item1 label
0 a 2
1 b 0
2 c 0
The good thing about this method is that it allows you to also get the number of 0 easily, which if you use the sum you don't get
Filter columns before groupby is not necessary, but you can specify column after groupby for aggregation sum. For 2 columns DataFrames add as_index=False parameter:
df = df.groupby('item1', as_index=False)['label'].sum()
Alternative is use Series.reset_index:
df = df.groupby('item1')['label'].sum().reset_index()
print (df)
item1 label
0 a 2
1 b 0
2 c 0

Append row to the end of a dataframe with loc function

I have a dataframe that is a result of some multiple step processing. I am adding one row to this dataframe like so:
df.loc[‘newindex’] = 0
Where ‘newindex’ is unique to the dataframe. I expect the new row to show up as a last row in the dataframe. But the row shows up somewhere near the middle of the dataframe.
What could be the reason of such behavior? I have to add row exactly at the last position, with its index name preserved.
* update *
I was wrong about uniqueness of the df index. The value has already been there.
I think value newindex is already in index, so loc select and overwite row instead append:
df = pd.DataFrame({'a':range(5)}, index=['a','s','newindex','d','f'])
print (df)
a
a 0
s 1
newindex 2
d 3
f 4
df.loc['newindex'] = 0
df.loc['newindex1'] = 0
print (df)
a
a 0
s 1
newindex 0
d 3
f 4
newindex1 0

I want to pick out one column of the DataFrame but the result is automatically ordered by values

I just need one column of my dateframe, but in the original order. When I take it off, it is sorted by the values, and I can't understand why. I tried different ways to pick out one column but all the time it was sorted by the values.
this is my code:
import pandas
data = pandas.read_csv('/data.csv', sep=';')
longti = data.iloc[:,4]
To return the first Column your function should work.
import pandas as pd
df = pd.DataFrame(dict(A=[1,2,3,4,5,6], B=['A','B','C','D','E','F']))
df = df.iloc[:,0]
Out:
0 1
1 2
2 3
3 4
4 5
5 6
If you want to return the second Column you can use the following:
df = df.iloc[:,1]
Out:
0 A
1 B
2 C
3 D
4 E
5 F

Pandas automatically converts row to column

I have a very simple dataframe like so:
In [8]: df
Out[8]:
A B C
0 2 a a
1 3 s 3
2 4 c !
3 1 f 1
My goal is to extract the first row in such a way that looks like this:
A B C
0 2 a a
As you can see the dataframe shape (1x3) is preserved and the first row still has 3 columns.
However when I type the following command df.loc[0] the output result is this:
df.loc[0]
Out[9]:
A 2
B a
C a
Name: 0, dtype: object
As you can see the row has turned into a column with 3 rows! (3x1 instead of 3x1). How is this possible? how can I simply extract the row and preserve its shape as described in my goal? Could you provide a smart and elegant way to do it?
I tried to use the transpose command .T but without success... I know I could create another dataframe where the columns are extracted by the original dataframe but this way quite tedious and not elegant I would say (pd.DataFrame({'A':[2], 'B':'a', 'C':'a'})).
Here is the dataframe if you need it:
import pandas as pd
df = pd.DataFrame({'A':[2,3,4,1], 'B':['a','s','c','f'], 'C':['a', 3, '!', 1]})
You need add [] for DataFrame:
#select by index value
print (df.loc[[0]])
A B C
0 2 a a
Or:
print (df.iloc[[0]])
A B C
0 2 a a
If need transpose Series, first need convert it to DataFrame by to_frame:
print (df.loc[0].to_frame())
0
A 2
B a
C a
print (df.loc[0].to_frame().T)
A B C
0 2 a a
Use a range selector will preserve the Dataframe format.
df.iloc[0:1]
Out[221]:
A B C
0 2 a a

Categories