DataFrame isn't dropping duplicates - python

I've seen other people pose this question, but I have been through the solutions and nothing is working so far.
My DataFrame isn't dropping duplicates. I don't know how to fix it.
combined_input_dat = pd.concat([input_dat, subcatch_dat_big], axis=1)
combined_dat = pd.concat([combined_input_dat, output_nolid_dat], axis=1)
date_first = date(2010,1,1)
date_last = date(2020,12,31)
date_delta = date_last-date_first
row_names_date = [(date_first + timedelta(days=i)).strftime('%m/%d/%Y') for i in range(date_delta.days + 1) ]
n_subcatchments = 140
row_names_date_long = np.repeat(row_names_date, n_subcatchments)
combined_input_dat['date']= row_names_date_long
I'm able to drop the duplicates before this code, but not after. Any suggestions would be greatly appreciated.


Problems with pd.merge

Hope you all are having an excellent week.
So, I was finishing a script that worked really well for an specific use case. The base is as follows:
Funcion cosine_similarity_join:
def cosine_similarity_join(a:pd.DataFrame, b:pd.DataFrame, col_name):
a_len = len(a[col_name])
# all of the "documents" in a 1D array
corpus = np.concatenate([a[col_name].to_numpy(), b[col_name].to_numpy()])
# vectorize the array
tfidf, vectorizer = fit_vectorizer(corpus, 3)
# in this matrix each row represents the str in a and the col is the str from b, value is the cosine similarity
res = cosine_similarity(tfidf[:a_len], tfidf[a_len:])
res_series = pd.DataFrame(res).stack().rename("score")
res_series.index.set_names(['a', 'b'], inplace=True)
# join scores to b
b_scored = pd.merge(left=b, right=res_series, left_index=True, right_on='b').droplevel('b')
# find the indices on which to match, (highest score in each row)
best_match = np.argmax(res, axis=1)
# Join the rest of
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
df = res.reset_index()
df = df.iloc[df.groupby(by="RefCol")["score"].idxmax()].reset_index(drop=True)
return df
This works like a charm when I do something like:
resulting_df = cosine_similarity_join(df1,df2,'My_col')
But in my case, I need something in the lines of:
big_df = pd.read_csv('some_really_big_df.csv')
some_other_df = pd.read_csv('some_other_small_df.csv')
counter = 0
size = 10000
total_size = len(big_df)
while counter <= total_size:
small_df = big_df[counter:counter+size]
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I already mapped the problem until one specific line in the function:
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
Basically this res dataframe is coming out empty and I just cannot understand why (since when I replicate the values outside of the loop it works just fine)...
I looked at the problem for hours now and would gladly accept a new light over the question.
Thank you all in advance!
Found the problem!
I just needed to reset the indexes for the join clause - once I create a new small df from the big df, the indexes remain equal to the slice of the big one, thus generating the problem when joining with another df!
So basically all I needed to do was:
while counter <= total_size:
small_df = big_df[counter:counter+size]
small_df = small_df.reset_index()
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I'll leave it here in case it helps someone :)

Iterate over loop with append data continuous

I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:

Apply multiple agg functions on groupby index

I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html ='\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
Added desired output (NOTE: was done on excel)
Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:

Even after adding a column to a dataframe my shape remains the same which means I am not able to add a column to my dataframe

This is my piece of code:
def segregate_files(self, list_of_csv, each_sub_folder):
new_list_of_csv = []
for each_csv in list_of_csv:
pattern = f"{each_sub_folder}/(.*?)/"
self.data_centre =, each_csv).group(1)
if "org_dashboards/" in each_csv:
each_csv = each_csv.replace("org_dashboards/", f"{self.file_path}/")
each_csv = each_csv.replace("dashboards/", f"{self.file_path}/")
df = pd.read_csv(each_csv)
df["Data Centre"] = self.data_centre
# self.list_of_sub_folder.append(f"files/{blob_name}")
self.aggregate_csv_path = f"{self.file_path}/{each_sub_folder}"
return new_list_of_csv, self.aggregate_csv_path
and my dataframe is properly able to read the csv
and there is no error in df["Data Centre"] = self.data_centre
only the shape remains the same
FYI the value of self.data_centre is also correct
Sorry my bad. It was a file write issue. Now it has been resolved. Thank you.

python dataframe - iterate through dataframes to find future date, considering previous iterations

Since half a year I am into python and all it incredible libraries, such as Panads Dataframes.
I am struggling to get the iteration logic (see attached image) implement in my code. The logic is pretty clear to me but unfortunately am not able to get the snippet coded!
I was wondering if there is someone out, who can give me the right hint?
Thank you very much in advance!
Transparent iteration logic
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'])
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020'])
import pandas as pd
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'], format="%d.%m.%Y")
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020', '03.11.2021'], format="%d.%m.%Y")
lst_df1 = list(df1.sort_values())
lst_df2 = list(df2.sort_values())
dict_df3 = {}
window_start = lst_df2[0]
window_stop = lst_df2[1]
for date in lst_df1:
while date > window_stop:
window_start = lst_df2[0]
lst_df2 = lst_df2[1:]
window_stop = lst_df2[0]
dict_df3[date] = window_start
df3 = pd.DataFrame.from_dict(dict_df3, orient='index').reset_index()
