Removing doubles while iterating a dataframe

Removing doubles while iterating a dataframe - python

I'm trying to remove doubles from a dataframe.
Basically, the dataframe contains two (or more) occurence of a document.
The doubles can be found by comparing the description of the document.
In my logic, I had to find who the duplicates are, copy the data and drop them from both the dataframe and the iterated dataframe.
But it appears there are still doubles, I do think it is because of the drop but don't know how to fix it.
So what is in green is the description, I need to drop one of the two, and fuse all that there is in black.
For example:
URL1 + URL2|Explorimmo + Bien_ici|Apartment|Description
Unfortunately, I can't link the dataset.
file = pd.ExcelFile(mc.file_path)
df = pd.read_excel(file)
description_duplicate = df.loc[df.duplicated(['DESCRIPTION']) == True]
for idx1, clean in description_duplicate.iterrows():
for idx2, dirty in description_duplicate.iterrows():
if idx1 != idx2:
if clean['DESCRIPTION'] == dirty['DESCRIPTION']:
clean['CRAWL_SOURCE'] = clean['CRAWL_SOURCE'] + " / " +dirty['CRAWL_SOURCE']
clean['URL'] = clean['URL'] + " / " + dirty['URL']
description_duplicate = description_duplicate.drop(idx2)
df = df.drop(idx2)
df[idx1] = clean

You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function:
df.drop_duplicates(subset='DESCRIPTION', inplace=True)

Related

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?

convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)

So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.

import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Avoid having to repeat the same dataframe column names when modifying them

I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)

Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)

How to access a cell in a new dataframe?

I created a sub dataframe (drama_df) based on a criteria in the original dataframe (df). However, I can't access a cell using the typical drama_df['summary'][0] . Instead I get a KeyError: 0. I'm confused since type(drama_df) is a DataFrame. What do I do? Note that df['summary'][0] does indeed return a string.
drama_df = df[df['drama'] > 0]
#Now we generate a lump of text from the summaries
drama_txt = ""
i = 0
while (i < len(drama_df)):
drama_txt = drama_txt + " " + drama_df['summary'][i]
i += 1
EDIT
Here is an example of df:
Here is an example of drama_df:

This will solve it for you:
drama_df['summary'].iloc[0]
When you created the subDataFrame you probably left the index 0 behind. So you need to use iloc to get the element by position and not by index name (0).
You can also use .iterrows() or .itertuples() to do this routine:
Itertuples is a lot faster, but it is a bit more work to handle if you have a lot of columns
for row in drama_df.iterrows():
drama_txt = drama_txt + " " + row['summary']
To go faster:
for index, summary in drama_df[['summary']].itertuples():
drama_txt = drama_txt + " " + summary

Wait a moment here. You are looking for the str.join() operation.
Simply do this:
drama_txt = ' '.join(drama_df['summary'])
Or:
drama_txt = drama_df['summary'].str.cat(sep=' ')

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic

Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing doubles while iterating a dataframe - python

You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function: df.drop_duplicates(subset='DESCRIPTION', inplace=True)

Related

How to work with Rows/Columns from CSV files?

Randomization of a list with conditions using Pandas

Avoid having to repeat the same dataframe column names when modifying them

How to access a cell in a new dataframe?

Pandas - Working on multiple columns seems slow

Categories

Resources