I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.
I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.
===============================================================================================
name language_1 language_2 rating bilingual
Kevin English Null 4.25
Miguel English Spanish 4.56
Carlos English Spanish 4.61
Aaron Null Spanish 4.33
===============================================================================================
Here is the code I've tried to use to append the new column to my dataframe.
def label_bilingual(row):
if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
val = 1
else:
val = 0
df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
Here is the error I'm getting.
----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable
You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.
1 - You have tried to call the column with row('name') which is not callable.
df('row')
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df('row')
TypeError: 'DataFrame' object is not callable
2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist
KeyError: 'English'
3 - You do not return any values
val = 1
val = 0
You need to modify your function as below to resolve these errors.
def label_bilingual(row):
if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
return 1
else:
return 0
Output
>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
name language_1 language_2 rating bilingual
0 Kevin English Null 4.25 0
1 Miguel English Spanish 4.56 1
2 Carlos English Spanish 4.61 1
3 Aaron Null Spanish 4.33 0
To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:
bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual
Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.
def stack_plot(data, xtick, col2='project_is_approved', col3='total'):
ind = np.arange(data.shape[0])
plt.figure(figsize=(20,5))
p1 = plt.bar(ind, data[col3].values)
p2 = plt.bar(ind, data[col2].values)
plt.ylabel('Projects')
plt.title('Number of projects aproved vs rejected')
plt.xticks(ind, list(data[xtick].values))
plt.legend((p1[0], p2[0]), ('total', 'accepted'))
plt.show()
def univariate_barplots(data, col1, col2='project_is_approved', top=False):
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg(lambda x: x.eq(1).sum())).reset_index()
# Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
temp.sort_values(by=['total'],inplace=True, ascending=False)
if top:
temp = temp[0:top]
stack_plot(temp, xtick=col1, col2=col2, col3='total')
print(temp.head(5))
print("="*50)
print(temp.tail(5))
univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
Error:
SpecificationError Traceback (most recent call last)
<ipython-input-21-2cace8f16608> in <module>()
----> 1 univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
<ipython-input-20-856fcc83737b> in univariate_barplots(data, col1, col2, top)
4
5 # Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
----> 6 temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
7 print (temp['total'].head(2))
8 temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
251 # but not the class list / tuple itself.
252 func = _maybe_mangle_lambdas(func)
--> 253 ret = self._aggregate_multiple_funcs(func)
254 if relabeling:
255 ret.columns = columns
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in _aggregate_multiple_funcs(self, arg)
292 # GH 15931
293 if isinstance(self._selected_obj, Series):
--> 294 raise SpecificationError("nested renamer is not supported")
295
296 columns = list(arg.keys())
SpecificationError: **nested renamer is not supported**
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
This error also happens if a column specified in the aggregation function dict does not exist in the dataframe:
In [190]: group = pd.DataFrame([[1, 2]], columns=['A', 'B']).groupby('A')
In [195]: group.agg({'B': 'mean'})
Out[195]:
B
A
1 2
In [196]: group.agg({'B': 'mean', 'non-existing-column': 'mean'})
...
SpecificationError: nested renamer is not supported
I found the way: Instead of going like
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{"maxQ":np.max,"minQ":np.min,"meanQ":np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
Do as follows:
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{np.max,np.min,np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
I had the same error and this is how I resolved it!
Do you get the same error if you change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
to
temp['total'] = project_data.groupby(col1)[col2].agg(total=('total','count')).reset_index()['total']
Instead of using .agg({'total':'count'})), you can pass name with the function as a list of tuple like .agg([('total', 'count')])and use the same for Avg also. Hope it would work.
I have got the similar issue as #akshay jindal, but I check the documentation as suggested by #artikay Khanna, the problem solved, some functions has been adjusted, the old is deprecated. Here is the code warning provided per last time execute.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
Therefore, I will suggest try
grouper.agg(name_1=func_1, name_2=func_2)
Hope this will help
Not a very elegant solution but this one works. As renaming the column is deprecated with the way you are doing. But there is work around. Create a temporary variable 'approved' , store the col2 in it. Because when you apply agg function , the original column values will change with column name. You can preserve the column name but then values in those column will change. So in order to preserve the original dataframe and to have two new columns with desired names, you can use the following code.
approved = temp[col2]
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg([('Avg','mean'),('total','count')]).reset_index())
temp[col2] = approved
P.S : Seems like an assignment of AAIC, I am working on same :)
Sometimes it's convenient to keep an aggdict of how each column should be transformed under aggregation that will work with different column sets and different group by columns. You can do this with the new syntax fairly easily by unpacking the dict with **. Here's a minimal working example for simple data.
dfx=pd.DataFrame(columns=["A","B","C"],data=np.random.randint(0,5,size=(10,3)))
#dfx
#
# A B C
#0 4 4 1
#1 2 4 4
#2 1 3 3
#3 2 4 3
#4 1 2 1
#5 0 4 2
#6 2 3 4
#7 1 0 2
#8 2 1 4
#9 3 0 3
Maybe when you agg you want the first "A", the last "B", the mean "C" and sometimes your pipeline has a "D" (but not this time) that you also want the mean of.
aggdict = {"A":lambda x: x.iloc[0], "B": lambda x: x.iloc[-1], "C" : "mean" , "D":lambda x: "mean"}
You can build a simple dict like the old days and then unpack it with ** filtering on the relevant keys:
gb_col="C"
gbc = dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
# A B
#C
#1 4 2
#2 0 0
#3 1 4
#4 2 3
And then you can slice and dice how you want with the same syntax:
mygb = lambda gb_col: dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
allgb = [mygb(c) for c in dfx.columns]
I have tried alll the solutions and turned out to be the error with the name. If your column name has some inbuilt keywords such as "in", "is",etc., It is throwing error. In my case, My column name is "Points in Polygon" and I have resolved the issue by renaming the column to "Points"
#Rishi's solution worked for me. The original name of the column in my dataframe was net_value_budgeted_rate, which was essentially dollar value of the sale. I changed it to dollars and it worked.
Info = pd.DataFrame(df.groupby("school_state").agg(Approved=("project_is_approved",lambda x: x.eq(1).sum()),Total=("project_is_approved","count"),Avg=("project_is_approved","mean"))).reset_index().sort_values(by=["Total"],ascending=False).head()
You can break this into individual commands for better readability.
I have these two columns in my csv (Address of New Home and Cancelled Can in the csv). If any Address is cancelled, under Can True has to be written but sometimes the end user forget to write True and the same Address appears twice. I want Python to tell me(not remove) the Addresses that appear twice without the first one being cancelled out.
Example:
Date_Booked Address of New Home Can
01/07/2017 1234 SO Drive True
02/14/2017 4321 Python Court
03/17/2017 1234 SO Drive
03/23/2017 4321 Python Court
As you can view from the above example, 1234 SO Drive was cancelled and True was written, this is what we want but 4321 Python Court was cancelled that is why it was written twice but since it does not say True under the Cancelled it will show up twice in our csv and cause all sorts of issues.
import pandas as pd
first = pd.read_csv('Z:PCR.csv')
df = pd.DataFrame(first)
non_cancelled = df['Can'].apply(lambda x: x != 'True')
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
if not dup_addresses.empty:
raise Exception ('Same address written twice without cancellation')
I am getting the following error:
Traceback (most recent call last):
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)
TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
dup_addresses = non_cancelled.groupby('Address of New Home').filter(lambda x: len (x) > 1)
KeyError: 'Address of New Home'
Any assistance would be greatly appreciated.
This should update your Can column by keeping the True that is already there an updating with ones that were missed.
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(can.where(can, ''))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court True
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court
Per request
can = df.duplicated(subset=['Address of New Home'], keep='last')
df['Can'] = df.Can.combine_first(pd.Series(np.where(can, 'Missed', ''), df.index))
print(df)
Date_Booked Address of New Home Can
0 01/07/2017 1234 SO Drive True
1 02/14/2017 4321 Python Court Missed
2 03/17/2017 1234 SO Drive
3 03/23/2017 4321 Python Court
Your column is Address_of_New_Home, not Address of New Home. Just forgot the underscores
The problem is in this statement:
non_cancelled = df['Can'].apply(lambda x: x != 'True')
When you apply this argument, you are applying to to the series df['Can'], so the method returns a series, not the full DataFrame. To illustrate, here is some code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.arange(0,5), 'b': np.arange(5,10), 'c': np.arange(10,15)})
print(df)
The output is this
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
But when I do this:
a = df['a'].apply(lambda x: x*20)
print(a)
I get:
0 0
1 20
2 40
3 60
4 80
To do what you would like to do, try doing this instead:
non_cancelled = df[df['Can'] != True]
This gives us all rows in the DataFrame where the condition (df['Can'] != True) returns as True
I notice that many DataFrame functions if used without parentheses seem to behave like 'properties' e.g.
In [200]: df = DataFrame (np.random.randn (7,2))
In [201]: df.head ()
Out[201]:
0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
In [202]: df.head
Out[202]:
<bound method DataFrame.head of 0 1
0 -1.325883 0.878198
1 0.588264 -2.033421
2 -0.554993 -0.217938
3 -0.777936 2.217457
4 0.875371 1.918693
5 0.940440 -2.279781
6 1.152370 -2.733546>
How is this done and is it good practice ?
This is with pandas 0.15.1 on linux
They are different and not recommended, one clearly shows that it's a method and happens to output the results whilst the other shows the expected output.
Here's why you should not do this:
In [23]:
t = df.head
In [24]:
t.iloc[0]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-24-b523e5ce509d> in <module>()
----> 1 t.iloc[0]
AttributeError: 'function' object has no attribute 'iloc'
In [25]:
t = df.head()
t.iloc[0]
Out[25]:
0 0.712635
1 0.363903
Name: 0, dtype: float64
So OK you don't use parentheses to call the method correctly and see an output that appears valid but if you took a reference to this and tried to use it, you are operating on the method rather than the slice of the df which is not what you intended.