DataFrame isn't dropping duplicates - python

I've seen other people pose this question, but I have been through the solutions and nothing is working so far.
My DataFrame isn't dropping duplicates. I don't know how to fix it.
combined_input_dat = pd.concat([input_dat, subcatch_dat_big], axis=1)
combined_dat = pd.concat([combined_input_dat, output_nolid_dat], axis=1)
date_first = date(2010,1,1)
date_last = date(2020,12,31)
date_delta = date_last-date_first
row_names_date = [(date_first + timedelta(days=i)).strftime('%m/%d/%Y') for i in range(date_delta.days + 1) ]
n_subcatchments = 140
row_names_date_long = np.repeat(row_names_date, n_subcatchments)
combined_input_dat['date']= row_names_date_long
combined_input_dat.drop_duplicates(keep=False,inplace=True)
I'm able to drop the duplicates before this code, but not after. Any suggestions would be greatly appreciated.

Related

Problems with pd.merge

Hope you all are having an excellent week.
So, I was finishing a script that worked really well for an specific use case. The base is as follows:
Funcion cosine_similarity_join:
def cosine_similarity_join(a:pd.DataFrame, b:pd.DataFrame, col_name):
a_len = len(a[col_name])
# all of the "documents" in a 1D array
corpus = np.concatenate([a[col_name].to_numpy(), b[col_name].to_numpy()])
# vectorize the array
tfidf, vectorizer = fit_vectorizer(corpus, 3)
# in this matrix each row represents the str in a and the col is the str from b, value is the cosine similarity
res = cosine_similarity(tfidf[:a_len], tfidf[a_len:])
res_series = pd.DataFrame(res).stack().rename("score")
res_series.index.set_names(['a', 'b'], inplace=True)
# join scores to b
b_scored = pd.merge(left=b, right=res_series, left_index=True, right_on='b').droplevel('b')
# find the indices on which to match, (highest score in each row)
best_match = np.argmax(res, axis=1)
# Join the rest of
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
print(res)
df = res.reset_index()
df = df.iloc[df.groupby(by="RefCol")["score"].idxmax()].reset_index(drop=True)
return df
This works like a charm when I do something like:
resulting_df = cosine_similarity_join(df1,df2,'My_col')
But in my case, I need something in the lines of:
big_df = pd.read_csv('some_really_big_df.csv')
some_other_df = pd.read_csv('some_other_small_df.csv')
counter = 0
size = 10000
total_size = len(big_df)
while counter <= total_size:
small_df = big_df[counter:counter+size]
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I already mapped the problem until one specific line in the function:
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
Basically this res dataframe is coming out empty and I just cannot understand why (since when I replicate the values outside of the loop it works just fine)...
I looked at the problem for hours now and would gladly accept a new light over the question.
Thank you all in advance!
Found the problem!
I just needed to reset the indexes for the join clause - once I create a new small df from the big df, the indexes remain equal to the slice of the big one, thus generating the problem when joining with another df!
So basically all I needed to do was:
while counter <= total_size:
small_df = big_df[counter:counter+size]
small_df = small_df.reset_index()
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I'll leave it here in case it helps someone :)
Cheers!

Iterate over loop with append data continuous

I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)

Apply multiple agg functions on groupby index

I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html = wp.page(wiki_page).html().replace(u'\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.str.findall(find)
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
EDIT:
Added desired output (NOTE: was done on excel)
Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
italy_df
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:

Even after adding a column to a dataframe my shape remains the same which means I am not able to add a column to my dataframe

This is my piece of code:
def segregate_files(self, list_of_csv, each_sub_folder):
new_list_of_csv = []
for each_csv in list_of_csv:
pattern = f"{each_sub_folder}/(.*?)/"
self.data_centre = re.search(pattern, each_csv).group(1)
if "org_dashboards/" in each_csv:
each_csv = each_csv.replace("org_dashboards/", f"{self.file_path}/")
else:
each_csv = each_csv.replace("dashboards/", f"{self.file_path}/")
df = pd.read_csv(each_csv)
print(df.shape)
df["Data Centre"] = self.data_centre
print(df.shape)
df.to_csv(each_csv)
new_list_of_csv.append(each_csv)
# self.list_of_sub_folder.append(f"files/{blob_name}")
print(new_list_of_csv)
self.aggregate_csv_path = f"{self.file_path}/{each_sub_folder}"
return new_list_of_csv, self.aggregate_csv_path
and my dataframe is properly able to read the csv
and there is no error in df["Data Centre"] = self.data_centre
only the shape remains the same
FYI the value of self.data_centre is also correct
Sorry my bad. It was a file write issue. Now it has been resolved. Thank you.

python dataframe - iterate through dataframes to find future date, considering previous iterations

Since half a year I am into python and all it incredible libraries, such as Panads Dataframes.
I am struggling to get the iteration logic (see attached image) implement in my code. The logic is pretty clear to me but unfortunately am not able to get the snippet coded!
I was wondering if there is someone out, who can give me the right hint?
Thank you very much in advance!
Transparent iteration logic
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'])
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020'])
import pandas as pd
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'], format="%d.%m.%Y")
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020', '03.11.2021'], format="%d.%m.%Y")
lst_df1 = list(df1.sort_values())
lst_df2 = list(df2.sort_values())
dict_df3 = {}
window_start = lst_df2[0]
window_stop = lst_df2[1]
for date in lst_df1:
while date > window_stop:
window_start = lst_df2[0]
lst_df2 = lst_df2[1:]
window_stop = lst_df2[0]
dict_df3[date] = window_start
df3 = pd.DataFrame.from_dict(dict_df3, orient='index').reset_index()

Categories