iterating over a dataframe - python

I put this dataframe as an example:
import pandas as pd
df = pd.DataFrame({'country':['china','canda','usa' ], 'value':[1000, 850, 1100], 'fact':[1000,200,850]})
df.index=df['country']
df = df.drop('country', axis=1)
I want to iterate over the GDP of each country and into this iteration I want to create a new column that would be full of 1 or 0 in function of a condition:
for x in df['value']:
if x > 900:
df['answer']=1
else:
df['answer']=0
I would expected a column with the following values:
[1,0,1]
Because Canada has a value lower than 900.
But instead I have a column full of ones.
What is wrong?

Use np.where
df["answer"] = np.where(df["value"]> 900, 1,0)
Or
df["answer"] = (df["value"]> 900).astype(int)
Output:
value fact answer
country
china 1000 1000 1
canda 850 200 0
usa 1100 850 1
whats wrong with your code
When you do df['answer']=1, the expression assign 1 to all the rows in answer column.
So last evaluated value is assigned to that column

It can be even done without iterating over each row using:
df['answer'] = df['value'].apply(lambda value: 1 if value > 900 else 0)
EDIT You are assigning df['answer'] to some value. The last value is 1 that is why it applies 1 to the entire answer column and not a particular row.

Related

What is the pythonic way to do a conditional count across pandas dataframe rows with apply?

I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)

How to get pandas dataframe series name given a column value?

I have a python pandas dataframe with a bunch of names and series, and I create a final column where I sum up the series. I want to get just the row name where the sum of the series equals 0, so I can then later delete those rows. My dataframe is as follows (the last column I create just to sum up the series):
1 2 3 4 total
Ash 1 0 1 1 3
Bel 0 0 0 0 0
Cay 1 0 0 0 1
Jeg 0 1 1 1 3
Jut 1 1 1 1 4
Based on the last column, the series "Bel" is 0, so I want to be able to print out that name only, and then later I can delete that row or keep a record of these rows.
This is my code so far:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for values in df['total']:
if values == 0:
print(df.index[values)
But this obviously is wrong because I am passing the index of 0 to this loop, which will always print the name of the first row. Not sure what method I can implement here though?
There are great solutions below and I also found a way using a simpler python skill, enumerate (because I still find list comprehension hard to write):
def check_empty(df):
df['total'] = df.sum(axis=1)
for name, values in enumerate(df['total']):
if values == 0:
print(df.index[name])
One possible way may be following where df is filtered using value in total:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
index = df[df['total'] == 0].index.values.tolist()
print(index)
If you would like to iterate through row then, using df.iterrows() may be other way as well:
def check_empty(df):
df['total'] = df.sum(axis=1) # create the 'total' column to find zeroes
for index, row in df.iterrows():
if row['total'] == 0:
print(index)
Another option is np.where.
import numpy as np
df.iloc[np.where(df.loc[:, 'total'] == 0)]
Output:
1 2 3 4 total
Bel 0 0 0 0 0

drop rows after sum condition reached

I want to drop rows from my data frame after I hit some value.
example data set:
num value
1 2000
2 3000
3 2000
x = 5000 # my limiter
y = 0 # my bucket for values
# I want to do something like...
for row in df:
if y <= x:
y =+ df["Values"]
elif y > x:
df.drop(row)
continue
The elif might not make sense but it expresses the idea, it is the parsing I am more concerned with. I cant seem to use df["Values"] in my embedded if statement.
I get the error:
ValueError: The truth value of a Series is ambiguous.
which is odd because i can run this line by itself outside of the if statement.
Use boolean indexing with cumsum:
x = 5000
df = df[df['value'].cumsum() <= x]
print (df)
num value
0 1 2000
1 2 3000
Detail:
print (df['value'].cumsum())
0 2000
1 5000
2 7000
Name: value, dtype: int64
print (df['value'].cumsum() <= x)
0 True
1 True
2 False
Name: value, dtype: bool
You get this error message, because you assign the whole column to your variable y. Instead you want to assign only the value from column value and add it to your variable.
#print(df)
#num value
#1 2000
#2 3000
#3 2000
#4 4000
#5 1000
x = 5000
y = 0
#iterate over rows
for index, row in df.iterrows():
if y < x:
#add the value to y
y += row["value"]
elif y >= x:
#drop rest of the dataframe
df = df.drop(df.index[index:])
break
#output from print(df)
# num value
#0 1 2000
#1 2 3000
But it would be faster, if you just used pandas builtin cumsum function. (see jezrael's answer for details)

Pandas DataFrame: How to print single row horizontally?

Single row of a DataFrame prints value side by side, i.e. column_name then columne_value in one line and next line contains next column_name and columne_value. For example, below code
import pandas as pd
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
# other operations goes here....
print row
Output for first row comes as
0 100
1 200
2 300
Name: 0, dtype: int64
Is there a way to have each row printed horizontally and ignore the datatype, Name? Example for the first row:
0 1 2
100 200 300
use the to_frame method then transpose with T
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
print(row.to_frame().T)
0 1 2
0 100 200 300
0 1 2
1 400 500 600
note:
This is similar to #JohnE's answer in that the method to_frame is syntactic sugar around pd.DataFrame.
In fact if we follow the code
def to_frame(self, name=None):
"""
Convert Series to DataFrame
Parameters
----------
name : object, default None
The passed name should substitute for the series name (if it has
one).
Returns
-------
data_frame : DataFrame
"""
if name is None:
df = self._constructor_expanddim(self)
else:
df = self._constructor_expanddim({name: self})
return df
Points to _constructor_expanddim
#property
def _constructor_expanddim(self):
from pandas.core.frame import DataFrame
return DataFrame
Which you can see simply returns the callable DataFrame
Use the transpose property:
df.T
0 1 2
0 100 200 300
It seems like there should be a simpler answer to this, but try turning it into another DataFrame with one row.
data = {x: y for x, y in zip(df.columns, df.iloc[0])}
sf = pd.DataFrame(data, index=[0])
print(sf.to_string())
Sorta combining the two previous answers, you could do:
for index, ser in df.iterrows():
print( pd.DataFrame(ser).T )
0 1 2
0 100 200 300
0 1 2
1 400 500 600
Basically what happens is that if you extract a row or column from a dataframe, you get a series which displays as a column. And doesn't matter if you do ser or ser.T, it "looks" like a column. I mean, series are one dimensional, not two, but you get the point...
So anyway, you can convert the series to a dataframe with one row. (I changed the name from "row" to "ser" to emphasize what is happening above.) The key is you have to convert to a dataframe first (which will be a column by default), then transpose it.

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

Categories