def stack_plot(data, xtick, col2='project_is_approved', col3='total'):
ind = np.arange(data.shape[0])
plt.figure(figsize=(20,5))
p1 = plt.bar(ind, data[col3].values)
p2 = plt.bar(ind, data[col2].values)
plt.ylabel('Projects')
plt.title('Number of projects aproved vs rejected')
plt.xticks(ind, list(data[xtick].values))
plt.legend((p1[0], p2[0]), ('total', 'accepted'))
plt.show()
def univariate_barplots(data, col1, col2='project_is_approved', top=False):
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg(lambda x: x.eq(1).sum())).reset_index()
# Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
temp.sort_values(by=['total'],inplace=True, ascending=False)
if top:
temp = temp[0:top]
stack_plot(temp, xtick=col1, col2=col2, col3='total')
print(temp.head(5))
print("="*50)
print(temp.tail(5))
univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
Error:
SpecificationError Traceback (most recent call last)
<ipython-input-21-2cace8f16608> in <module>()
----> 1 univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
<ipython-input-20-856fcc83737b> in univariate_barplots(data, col1, col2, top)
4
5 # Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
----> 6 temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
7 print (temp['total'].head(2))
8 temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
251 # but not the class list / tuple itself.
252 func = _maybe_mangle_lambdas(func)
--> 253 ret = self._aggregate_multiple_funcs(func)
254 if relabeling:
255 ret.columns = columns
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in _aggregate_multiple_funcs(self, arg)
292 # GH 15931
293 if isinstance(self._selected_obj, Series):
--> 294 raise SpecificationError("nested renamer is not supported")
295
296 columns = list(arg.keys())
SpecificationError: **nested renamer is not supported**
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
This error also happens if a column specified in the aggregation function dict does not exist in the dataframe:
In [190]: group = pd.DataFrame([[1, 2]], columns=['A', 'B']).groupby('A')
In [195]: group.agg({'B': 'mean'})
Out[195]:
B
A
1 2
In [196]: group.agg({'B': 'mean', 'non-existing-column': 'mean'})
...
SpecificationError: nested renamer is not supported
I found the way: Instead of going like
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{"maxQ":np.max,"minQ":np.min,"meanQ":np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
Do as follows:
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{np.max,np.min,np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
I had the same error and this is how I resolved it!
Do you get the same error if you change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
to
temp['total'] = project_data.groupby(col1)[col2].agg(total=('total','count')).reset_index()['total']
Instead of using .agg({'total':'count'})), you can pass name with the function as a list of tuple like .agg([('total', 'count')])and use the same for Avg also. Hope it would work.
I have got the similar issue as #akshay jindal, but I check the documentation as suggested by #artikay Khanna, the problem solved, some functions has been adjusted, the old is deprecated. Here is the code warning provided per last time execute.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
Therefore, I will suggest try
grouper.agg(name_1=func_1, name_2=func_2)
Hope this will help
Not a very elegant solution but this one works. As renaming the column is deprecated with the way you are doing. But there is work around. Create a temporary variable 'approved' , store the col2 in it. Because when you apply agg function , the original column values will change with column name. You can preserve the column name but then values in those column will change. So in order to preserve the original dataframe and to have two new columns with desired names, you can use the following code.
approved = temp[col2]
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg([('Avg','mean'),('total','count')]).reset_index())
temp[col2] = approved
P.S : Seems like an assignment of AAIC, I am working on same :)
Sometimes it's convenient to keep an aggdict of how each column should be transformed under aggregation that will work with different column sets and different group by columns. You can do this with the new syntax fairly easily by unpacking the dict with **. Here's a minimal working example for simple data.
dfx=pd.DataFrame(columns=["A","B","C"],data=np.random.randint(0,5,size=(10,3)))
#dfx
#
# A B C
#0 4 4 1
#1 2 4 4
#2 1 3 3
#3 2 4 3
#4 1 2 1
#5 0 4 2
#6 2 3 4
#7 1 0 2
#8 2 1 4
#9 3 0 3
Maybe when you agg you want the first "A", the last "B", the mean "C" and sometimes your pipeline has a "D" (but not this time) that you also want the mean of.
aggdict = {"A":lambda x: x.iloc[0], "B": lambda x: x.iloc[-1], "C" : "mean" , "D":lambda x: "mean"}
You can build a simple dict like the old days and then unpack it with ** filtering on the relevant keys:
gb_col="C"
gbc = dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
# A B
#C
#1 4 2
#2 0 0
#3 1 4
#4 2 3
And then you can slice and dice how you want with the same syntax:
mygb = lambda gb_col: dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
allgb = [mygb(c) for c in dfx.columns]
I have tried alll the solutions and turned out to be the error with the name. If your column name has some inbuilt keywords such as "in", "is",etc., It is throwing error. In my case, My column name is "Points in Polygon" and I have resolved the issue by renaming the column to "Points"
#Rishi's solution worked for me. The original name of the column in my dataframe was net_value_budgeted_rate, which was essentially dollar value of the sale. I changed it to dollars and it worked.
Info = pd.DataFrame(df.groupby("school_state").agg(Approved=("project_is_approved",lambda x: x.eq(1).sum()),Total=("project_is_approved","count"),Avg=("project_is_approved","mean"))).reset_index().sort_values(by=["Total"],ascending=False).head()
You can break this into individual commands for better readability.
Related
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have a CSV file that looks like below, this is same like my last question but this is by using Pandas.
Group Sam Dan Bori Son John Mave
A 0.00258844 0.983322 1.61479 1.2785 1.96963 10.6945
B 0.0026034 0.983305 1.61198 1.26239 1.9742 10.6838
C 0.0026174 0.983294 1.60913 1.24543 1.97877 10.6729
D 0.00263062 0.983289 1.60624 1.22758 1.98334 10.6618
E 0.00264304 0.98329 1.60332 1.20885 1.98791 10.6505
I have a function like below
def getnewno(value):
value = value + 30
if value > 40 :
value = value - 20
else:
value = value
return value
I want to send all these values to the getnewno function and get a newvalue and update the CSV file. How can this be accomplished in Pandas.
Expected output:
Group Sam Dan Bori Son John Mave
A 30.00258844 30.983322 31.61479 31.2785 31.96963 20.6945
B 30.0026034 30.983305 31.61198 31.26239 31.9742 20.6838
C 30.0026174 30.983294 31.60913 31.24543 31.97877 20.6729
D 30.00263062 30.983289 31.60624 31.22758 31.98334 20.6618
E 30.00264304 30.98329 31.60332 31.20885 31.98791 20.6505
The following should give you what you desire.
Applying a function
Your function can be simplified and here expressed as a lambda function.
It's then a matter of applying your function to all of the columns. There are a number of ways to do so. The first idea that comes to mind is to loop over df.columns. However, we can do better than this by using the applymap or transform methods:
import pandas as pd
# Read in the data from file
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
# Simplified function with which to transform data
getnewno = lambda value: value + 10 if value > 10 else value + 30
# Looping over columns
#for col in df.columns:
# df[col] = df[col].apply(getnewno)
# Apply to all columns without loop
df = df.applymap(getnewno)
# Write out updated data
df.to_csv('data_updated.csv')
Using broadcasting
You can achieve your result using broadcasting and a little boolean logic. This avoids looping over any columns, and should ultimately prove faster and less memory intensive (although if your dataset is small any speed-up would be negligible):
import pandas as pd
df = pd.read_csv('data.csv',
sep='\s+',
index_col=0)
df += 30
make_smaller = df > 40
df[make_smaller] -= 20
First of all, your getnewno function looks too complicated... it can be simplified to e.g.:
def getnewno(value):
if value + 30 > 40:
return value - 20
else:
return value
you can even change value + 30 > 40 to value > 10.
Or even a oneliner if you want:
getnewno = lambda value: value-20 if value > 10 else value
Having the function you can apply it to specific values/columns. For example, if want you to create a column Mark_updated basing on Mark column, it should look like this (I assume your pandas DataFrame is called df):
df['Mark_updated'] = df['Mark'].apply(getnewno)
Use the mask function to do an if-else solution, before writing the data to csv
res = (df
.select_dtypes('number')
.add(30)
#the if-else comes in here
#if any entry in the dataframe is greater than 40, subtract 20 from it
#else leave as is
.mask(lambda x: x>40, lambda x: x.sub(20))
)
#insert the group column back
res.insert(0,'Group',df.Group.array)
write to csv
res.to_csv(filename)
Group Sam Dan Bori Son John Mave
0 A 30.002588 30.983322 31.61479 31.27850 31.96963 20.6945
1 B 30.002603 30.983305 31.61198 31.26239 31.97420 20.6838
2 C 30.002617 30.983294 31.60913 31.24543 31.97877 20.6729
3 D 30.002631 30.983289 31.60624 31.22758 31.98334 20.6618
4 E 30.002643 30.983290 31.60332 31.20885 31.98791 20.6505
def stack_plot(data, xtick, col2='project_is_approved', col3='total'):
ind = np.arange(data.shape[0])
plt.figure(figsize=(20,5))
p1 = plt.bar(ind, data[col3].values)
p2 = plt.bar(ind, data[col2].values)
plt.ylabel('Projects')
plt.title('Number of projects aproved vs rejected')
plt.xticks(ind, list(data[xtick].values))
plt.legend((p1[0], p2[0]), ('total', 'accepted'))
plt.show()
def univariate_barplots(data, col1, col2='project_is_approved', top=False):
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg(lambda x: x.eq(1).sum())).reset_index()
# Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
temp.sort_values(by=['total'],inplace=True, ascending=False)
if top:
temp = temp[0:top]
stack_plot(temp, xtick=col1, col2=col2, col3='total')
print(temp.head(5))
print("="*50)
print(temp.tail(5))
univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
Error:
SpecificationError Traceback (most recent call last)
<ipython-input-21-2cace8f16608> in <module>()
----> 1 univariate_barplots(project_data, 'school_state', 'project_is_approved', False)
<ipython-input-20-856fcc83737b> in univariate_barplots(data, col1, col2, top)
4
5 # Pandas dataframe grouby count: https://stackoverflow.com/a/19385591/4084039
----> 6 temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
7 print (temp['total'].head(2))
8 temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
251 # but not the class list / tuple itself.
252 func = _maybe_mangle_lambdas(func)
--> 253 ret = self._aggregate_multiple_funcs(func)
254 if relabeling:
255 ret.columns = columns
~\AppData\Roaming\Python\Python36\site-packages\pandas\core\groupby\generic.py in _aggregate_multiple_funcs(self, arg)
292 # GH 15931
293 if isinstance(self._selected_obj, Series):
--> 294 raise SpecificationError("nested renamer is not supported")
295
296 columns = list(arg.keys())
SpecificationError: **nested renamer is not supported**
change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'Avg':'mean'})).reset_index()['Avg']
to
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(total='count')).reset_index()['total']
temp['Avg'] = pd.DataFrame(project_data.groupby(col1)[col2].agg(Avg='mean')).reset_index()['Avg']
reason: in new pandas version named aggregation is the recommended replacement for the deprecated “dict-of-dicts” approach to naming the output of column-specific aggregations (Deprecate groupby.agg() with a dictionary when renaming).
source: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html
This error also happens if a column specified in the aggregation function dict does not exist in the dataframe:
In [190]: group = pd.DataFrame([[1, 2]], columns=['A', 'B']).groupby('A')
In [195]: group.agg({'B': 'mean'})
Out[195]:
B
A
1 2
In [196]: group.agg({'B': 'mean', 'non-existing-column': 'mean'})
...
SpecificationError: nested renamer is not supported
I found the way: Instead of going like
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{"maxQ":np.max,"minQ":np.min,"meanQ":np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
Do as follows:
g2 = df.groupby(["Description","CustomerID"],as_index=False).agg({'Quantity':{np.max,np.min,np.mean}})
g2.columns = ["Description","CustomerID","maxQ","minQ",'meanQ']
I had the same error and this is how I resolved it!
Do you get the same error if you change
temp['total'] = pd.DataFrame(project_data.groupby(col1)[col2].agg({'total':'count'})).reset_index()['total']
to
temp['total'] = project_data.groupby(col1)[col2].agg(total=('total','count')).reset_index()['total']
Instead of using .agg({'total':'count'})), you can pass name with the function as a list of tuple like .agg([('total', 'count')])and use the same for Avg also. Hope it would work.
I have got the similar issue as #akshay jindal, but I check the documentation as suggested by #artikay Khanna, the problem solved, some functions has been adjusted, the old is deprecated. Here is the code warning provided per last time execute.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version. Use named aggregation instead.
>>> grouper.agg(name_1=func_1, name_2=func_2)
"""Entry point for launching an IPython kernel.
Therefore, I will suggest try
grouper.agg(name_1=func_1, name_2=func_2)
Hope this will help
Not a very elegant solution but this one works. As renaming the column is deprecated with the way you are doing. But there is work around. Create a temporary variable 'approved' , store the col2 in it. Because when you apply agg function , the original column values will change with column name. You can preserve the column name but then values in those column will change. So in order to preserve the original dataframe and to have two new columns with desired names, you can use the following code.
approved = temp[col2]
temp = pd.DataFrame(project_data.groupby(col1)[col2].agg([('Avg','mean'),('total','count')]).reset_index())
temp[col2] = approved
P.S : Seems like an assignment of AAIC, I am working on same :)
Sometimes it's convenient to keep an aggdict of how each column should be transformed under aggregation that will work with different column sets and different group by columns. You can do this with the new syntax fairly easily by unpacking the dict with **. Here's a minimal working example for simple data.
dfx=pd.DataFrame(columns=["A","B","C"],data=np.random.randint(0,5,size=(10,3)))
#dfx
#
# A B C
#0 4 4 1
#1 2 4 4
#2 1 3 3
#3 2 4 3
#4 1 2 1
#5 0 4 2
#6 2 3 4
#7 1 0 2
#8 2 1 4
#9 3 0 3
Maybe when you agg you want the first "A", the last "B", the mean "C" and sometimes your pipeline has a "D" (but not this time) that you also want the mean of.
aggdict = {"A":lambda x: x.iloc[0], "B": lambda x: x.iloc[-1], "C" : "mean" , "D":lambda x: "mean"}
You can build a simple dict like the old days and then unpack it with ** filtering on the relevant keys:
gb_col="C"
gbc = dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
# A B
#C
#1 4 2
#2 0 0
#3 1 4
#4 2 3
And then you can slice and dice how you want with the same syntax:
mygb = lambda gb_col: dfx.groupby(gb_col).agg(**{k:(k,v) for k,v in aggdict.items() if k in dfx.columns and k != gb_col})
allgb = [mygb(c) for c in dfx.columns]
I have tried alll the solutions and turned out to be the error with the name. If your column name has some inbuilt keywords such as "in", "is",etc., It is throwing error. In my case, My column name is "Points in Polygon" and I have resolved the issue by renaming the column to "Points"
#Rishi's solution worked for me. The original name of the column in my dataframe was net_value_budgeted_rate, which was essentially dollar value of the sale. I changed it to dollars and it worked.
Info = pd.DataFrame(df.groupby("school_state").agg(Approved=("project_is_approved",lambda x: x.eq(1).sum()),Total=("project_is_approved","count"),Avg=("project_is_approved","mean"))).reset_index().sort_values(by=["Total"],ascending=False).head()
You can break this into individual commands for better readability.
This has been puzzling mew for a while. I have the following data set denoted under raw data, and have run two checks, #1 to identify a sample duplicate, and #2 to remove duplicates with drop_duplicates. The #1 test does identify duplicates, yet #2 does not seem to remove any duplicates.
raw_data = {'link':
['https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLJ.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLH.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLj.html#cda8700ef5',
'https://www.otodom.pl/oferta/mieszkanie-w-spokojnej-okolicy-gdansk-lostowice-ID43FLh.html#cda8700ef5',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWU.html#9dca9667c3',
'https://www.otodom.pl/oferta/zielony-widok-mieszkanie-3m04-ID43EWu.html#9dca9667c3',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQM.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQJ.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQm.html#af24036d28',
'https://www.otodom.pl/oferta/nowoczesne-osiedle-gotowe-do-konca-roku-bazantow-ID43vQj.html#af24036d28',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWY.html#2d0084b7ea',
'https://www.otodom.pl/oferta/mieszkanie-56-m-warszawa-ID43sWy.html#2d0084b7ea',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152',
'https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152']}
df = pd.DataFrame(raw_data, columns = ["link"])
#duplicate check #1
a = df.iloc[12][0]
b = df.iloc[13][0]
if a == b:
print("equal")
#duplicate check #2
df.drop_duplicates(['link'], keep='first')
Output:
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
equal
link
0 https://www.otodom.pl/oferta/mieszkanie-w-spok...
1 https://www.otodom.pl/oferta/mieszkanie-w-spok...
2 https://www.otodom.pl/oferta/mieszkanie-w-spok...
3 https://www.otodom.pl/oferta/mieszkanie-w-spok...
4 https://www.otodom.pl/oferta/zielony-widok-mie...
5 https://www.otodom.pl/oferta/zielony-widok-mie...
6 https://www.otodom.pl/oferta/nowoczesne-osiedl...
7 https://www.otodom.pl/oferta/nowoczesne-osiedl...
8 https://www.otodom.pl/oferta/nowoczesne-osiedl...
9 https://www.otodom.pl/oferta/nowoczesne-osiedl...
10 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
11 https://www.otodom.pl/oferta/mieszkanie-56-m-w...
12 https://www.otodom.pl/oferta/idealny-2pok-apar...
13 https://www.otodom.pl/oferta/idealny-2pok-apar...
Help would appreciated with reasoning why duplicates do not drop, thanks!
You have to reassign the output of drop_duplicates either to df or to a new variable. It does not happen in-place.
df2 = df.drop_duplicates(['link'], keep='first')
The links provided are not same.
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4X.html#64f19d3152
https://www.otodom.pl/oferta/idealny-2pok-apartament-0-pcc-widok-na-park-ID43q4x.html#64f19d3152
In one link it is X and in other it is x
Also variable a and b are always None
so it print equal
I was using python and pandas to do some statistical analysis on data and at some point I needed to add some new columns with assign function
df_res = (
df
.assign(col1 = lambda x: np.where(x['event'].str.contains('regex1'),1,0))
.assign(col2 = lambda x: np.where(x['event'].str.contains('regex2'),1,0))
.assign(mycol = lambda x: np.where(x['event'].str.contains('regex3'),1,0))
.assign(newcol = lambda x: np.where(x['event'].str.contains('regex4'),1,0))
)
I wanted to know if there is any way to add columns names and my regex to a dictionary and use a for loop or another lambda expression to assign these columns automatically:
Dic = {'col1':'regex1','col2':'regex2','mycol':'regex3','newcol':'regex4'}
df_res = (
df
.assign(...using Dic here...)
)
I need to add more columns later and I think it will make it easier to add new columns later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.
Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
If you map all your regex so that each dictionary value holds a lambda instead of just the regex, you can simply unpack the dic into assign:
lambda_dict = {
col:
lambda x, regex=regex: (
x['event'].
str.contains(regex)
.astype(int)
)
for col, regex in Dic.items()
}
res = df.assign(**lambda_dict)
EDIT
Here's an example:
import pandas as pd
import random
random.seed(0)
events = ['apple_one', 'chicken_one', 'chicken_two', 'apple_two']
data = [random.choice(events) for __ in range(10)]
df = pd.DataFrame(data, columns=['event'])
regex_dict = {
'apples': 'apple',
'chickens': 'chicken',
'ones': 'one',
'twos': 'two',
}
lambda_dict = {
col:
lambda x, regex=regex: (
x['event']
.str.contains(regex)
.astype(int)
)
for col, regex in regex_dict.items()
}
res = df.assign(**lambda_dict)
print(res)
# Output
event apples chickens ones twos
0 apple_two 1 0 0 1
1 apple_two 1 0 0 1
2 apple_one 1 0 1 0
3 chicken_two 0 1 0 1
4 apple_two 1 0 0 1
5 apple_two 1 0 0 1
6 chicken_two 0 1 0 1
7 apple_two 1 0 0 1
8 chicken_two 0 1 0 1
9 chicken_one 0 1 1 0
The problem with the prior code was that the regex was only evaluated during the last loop. Adding it as a default argument fixes this.
This can do what you want to do
pd.concat([df,pd.DataFrame({a:list(df["event"].str.contains(b)) for a,b in Dic.items()})],axis=1)
Actually using a for loop will do the same
If I understand you question correctly, you're trying to rename the columns, in which case I think you could just use Pandas rename function. This would look like
df_res = df_res.rename(mapper=Dic)
-Ben