I have dataframe df_my that looks like this
id name age major
----------------------------------------
0 1 Mark 34 English
1 2 Tom 55 Art
2 3 Peter 31 Science
3 4 Mohammad 23 Math
4 5 Mike 47 Art
...
I am trying to get the value of major (only)
I used this and it works fine when I know the id of the record
df_my["major"][3]
returns
"Math"
great
but I want to get the major for a variable record
I used
i = 3
df_my.loc[df_my["id"]==i]["major"]
and also used
i = 3
df_my[df_my["id"]==i]["major"]
but they both return
3 Math
it includes the record index too
how can I get the major only and nothing else?
You could use squeeze:
i = 3
out = df.loc[df['id']==i,'major'].squeeze()
Another option is iat:
out = df.loc[df['id']==i,'major'].iat[0]
Output:
'Science'
I also stumbled over this problem, from a little different angle:
df = pd.DataFrame({'First Name': ['Kumar'],
'Last Name': ['Ram'],
'Country': ['India'],
'num_var': 1})
>>> df.loc[(df['First Name'] == 'Kumar'), "num_var"]
0 1
Name: num_var, dtype: int64
>>> type(df.loc[(df['First Name'] == 'Kumar'), "num_var"])
<class 'pandas.core.series.Series'>
So it returns a Series (although it is only a series with only 1 element). If you access through the index, you receive the integer.
df.loc[0, "num_var"]
1
type(df.loc[0, "num_var"])
<class 'numpy.int64'>
The answer on how to select the respective, single value was already given above. However, I think it is interesting to note that accessing through an index always gives the single value whereas accessing through a condition returns a series. This is, b/c accessing with index clearly returns only one value whereas accessing through a condition can return several values.
If one of the columns of your dataframe is the natural primary index for those data, then it's usually a good idea to make pandas aware of it by setting the index accordingly:
df_my.set_index('id', inplace=True)
Now you can easily get just the major value for any id value i:
df_my.loc[i, 'major']
Note that for i = 3, the output is 'Science', which is expected, as noted in the comments to your question above.
Related
I was going through the book and there is a piece of code that I'm neither able to run nor debug. It's on page 180 if anyone is interested.
So I'm trying to replace a column of type object that has many categories into one value. This is how the table looks:
*Company_size*
Just me
More than 5,000
More than 5,000
Not sure
2-10
11-50
51-500
...
And the value 'Just me' should be replaced by 1, 'More than 5,000' by 5000, 'Not sure' by NaN, '2-10' by 2, '11-50' by 11 and so on.
The code to replace it is:
jb2 = jb[uniq_cols].rename(columns=lambda c: c.replace('.','_')).assign(company_size=lambda df_: df_.company_size.replace({'Just me': 1, 'Not sure':np.nan, 'More than 5,000':5000, '2-10':2, '11-50':11, '51-500':51, '501-1,000':501, '1,001-5,000':1001}))
This code involves a step before which is:
to use only the relevant columns using the list called uniq_cols and
to convert dot into underscores in the column headers
The code converts the first three values which are 'Just me', 'Not sure' and 'More and 5,000' properly. The rest remain the same.
I have tried to use .replace() at a fundamental level to see how it works. The code:
>>> df = pd.DataFrame.from_dict({'total revenue':['0-10', '11-100', '101-500', '501-1000']})
>>> df
total revenue
0 0-10
1 11-100
2 101-500
3 501-1000
>>> df['total revenue'] = df['total revenue'].replace({'0-10':0,'11-100':11,'101-500':101,'501-1000':501})
>>> df['total revenue']
0 0
1 11
2 101
3 501
Name: total revenue, dtype: int64
>>>
This works perfectly. I have tried to convert the original column into string from object type before using .replace(). Also tried to add 'inplace=True' in the original code, but they don't work either. I don't know if the code doesn't work inherently or there is some problem in my understanding of the .replace() function.
The link to the dataset.
I don't understand why I am getting the dreaded warning when I am doing exactly as instructed by the official documentation.
We have a dataframe called a
a = pd.DataFrame(data = [['Tom',1],
['Tom',1],
['Dick',1],
['Dick',1],
['Harry',1],
['Harry',1]], columns = ['Col1', 'Col2'])
a
Out[377]:
Col1 Col2
0 Tom 1
1 Tom 1
2 Dick 1
3 Dick 1
4 Harry 1
5 Harry 1
First we create a "holder" dataframe:
holder = a
Then we create a subset of a:
c = a.loc[a['Col1'] == 'Tom',:]
c
Out[379]:
Col1 Col2
0 Tom 1
1 Tom 1
We create another subset d which will be added to (a slice of) the previous subset c but once we try to add d to c, we get the warning:
d = a.loc[a['Col1'] == 'Tom','Col2']
d
Out[389]:
0 1
1 1
c.loc[:,'Col2'] += d
C:\Users\~\anaconda3\lib\site-packages\pandas\core\indexing.py:494: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item] = s
I would like to understand what I am doing wrong because I use this logic very often (coming from R where everything is not a darn object)
After noticing a different issue, I found a solution.
Whenever you say
dataframe_A = dataframe_B
you need to proceed with caution because Python, it seems, joins these two dataframes by the hip, so to speak. If you make changes to dataframe_B your dataframe_A will also change!
I understand just enough to fix the problem by using .copy(deep=True) where python will create a full and independent copy so that you can make changes to one without affecting the other one.
On further investigation, and for those interested, it apparently has to do with "pointers" which is a slightly complicated coding concept with a scope beyond this specific question.
I am trying to replace all the Country ISO codes to Full Country Names to keep everything consistent as part of cleaning some data. I managed to find the pycountry package, which helps a ton! There are some fields on the CSV file that are empty, which I believe is causing some issues when running my code below.
Also, an additional question, not sure if it's just me, but there are times when CSV reads empty files as null/NaN or simply empty. I don't really know what went wrong there, but if possible I would like to change all those empty cells into one "thing" or type for ease of filtering/dropping it out.
df = pd.read_csv("file.csv")
#use pycountry to match the Nationalities as actual country names
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(df):
if (len(df['Nationality'])==2 and df['Nationality'] in list_alpha_2):
return pycountry.countries.get(alpha_2=df['Nationality']).name
elif (len(df['Nationality'])==3 and df['Nationality'] in list_alpha_3):
return pycountry.countries.get(alpha_3=df['Nationality']).name
elif (len(df['Nationality'])>3):
return df['Nationality']
else:
return '#N/A'
df['Nationality']=df.apply(country_flag,axis =1)
df
I was expecting the result to be something like:
0 AF 100 Afghanistan
1 #N/A
2 AUS 140 Australia
3 Germany 400 Germany
The error message I am getting is
TypeError: ("object of type 'float' has no len()", 'occurred at index 0')
Yet, there shouldn't be any float type values in the 'Nationality' column I am working on. I am guessing this is simply the empty/null/NaN values being considered a float type?
One idea is remove misisng values first by Series.dropna and use Series.apply:
print (df)
Nationality
0 AF
1 NaN
2 AUS
3 Germany
import pycountry
list_alpha_2 = [i.alpha_2 for i in list(pycountry.countries)]
list_alpha_3 = [i.alpha_3 for i in list(pycountry.countries)]
def country_flag(x):
if (len(x)==2 and x in list_alpha_2):
return pycountry.countries.get(alpha_2=x).name
elif (len(x)==3 and x in list_alpha_3):
return pycountry.countries.get(alpha_3=x).name
elif (len(x)>=3):
return x
else:
return np.nan
df['Nationality'] = df['Nationality'].dropna().astype(str).apply(country_flag)
print (df)
Nationality
0 Afghanistan
1 NaN
2 Australia
3 Germany
One thing to watch out for is when pandas is reading from a data source and tries to automatically assign a data type to a column, it will sometimes assign a different data type than what you would expect depending upon if there are empty values or not in the data source.
A classical example is integer values that are converted to float values.
If you have a CSV file with this exact content (note missing value in row 2 of column A):
ColA,ColB
0,2
,1
5,4
then reading the file with
res_df=pandas.read_csv(filename)
will create a dataframe with floats in the column A and integers in the column B.
This is due to the fact that there is no canonical way to assign an "empty" value to an integer, whereas a float can just be set as NaN (not a number).
But if that value was present, you would get 2 columns of integers.
Just something to be aware of, as it may easily be forgotten, and then suddenly you are getting floats instead of integers in your code and be confused about it.
I am currently doing a project for school and I encountered a little problem. I have an airbnb dataset and I'm currently trying to fill some NaN values that I have on a column called Property_type with the most common value for property type for the different categories of the column ''accommodates'' (which gives back how many people that specific airbnb can take).
Here's a sample of the columns
property_type accommodates
Townhouse 2
Apartment 3
Townhouse 4
Townhouse 2
NaN 3
Townhouse 2
House 3
... ...
In this case, what I would want to do is find the most frequent type of property type that accommodates 3 people and fill the NaN values with that type of property.
My problem is in getting that most common value (I know what to do afterwards, but this step is not working)
I tried to find the most common values with this code
property_type_mode = airbnb[['property_type','accommodates']].groupby(['accommodates']).agg(lambda x:x.value_counts().index[0])
This returns the error:
IndexError: index 0 is out of bounds for axis 0 with size 0
I don't get why, because I've done similar things for other columns and it works.
Does anyone know what I can do to solve it!!
Thank you for your time!!
I think there is returned empty index array (one reason are missing values), so selecting return error. Solution is use next with iter with possible add value if no match:
f = lambda x: next(iter(x.value_counts().index), 'no match')
s = airbnb.groupby('accommodates')['property_type'].agg(f)
airbnb['property_type'] = airbnb['property_type'].fillna(airbnb['accommodates'].map(s))
Another solution is use dropna
f = lambda x: x.value_counts().index[0]
s = airbnb.dropna(subset=['accommodates']).groupby('accommodates')['property_type'].agg(f)
airbnb['property_type'] = airbnb['property_type'].fillna(airbnb['accommodates'].map(s))
This is my pandas dataframe I am referring to.
Basically I would like to be able to display a count on 'crime type' based on 'council'. So for example, where 'council == Fermanagh and omagh' count the
different values for each 'crime type' if that makes sense? So burgulary might be equal to 1 whereas, Anti-social behaviour' would be 3 for another 'council'? I then would like to plot these values on a bar graph.
Hope this makes some sense. Any help would be great. Thank you.
I believe you need groupby with size:
df1 = df.groupby(['crime type', 'council']).size().reset_index(name='Count')
EDIT:
df = pd.DataFrame({'crime type':['Anti-social behaviour','Anti-social behaviour',
'Burglary','Burglary'],
'council':['Fermanagh and omagh','Belfast','Belfast','Belfast']})
df1 = df.groupby(['council', 'crime type']).size().unstack(fill_value=0)
print (df1)
crime type Anti-social behaviour Burglary
council
Belfast 1 2
Fermanagh and omagh 1 0
df1.plot.bar()