Get duplicated rows in larger-than-memory dataset with pandas [duplicate] - python

This question already has an answer here:
How to drop duplicated rows using pandas in a big data file?
(1 answer)
Closed 6 years ago.
The pandas.dataframe.duplicated is great for finding duplicate rows across specified columns within a dataframe.
However, my dataset is larger than what fits in memory (and even larger than what I could fit in after extending it within reasonable budget limits).
This is fine for most of the analyses that I have to execute since I can loop over my dataset (csv and dbf files), loading each file into memory on its own and do everything in sequence. However, regarding duplicate analysis, this is apparently not suitable for finding duplicates across the whole dataset but only within single files.
Is there any algorithm or approach for finding duplicates across multiple dataframes while not having to load them all into memory at the same time?

You can hash the values of the "key" columns and maintain a set of hash codes you already encountered:
import hashlib
hash_set = set() # this will contain all the hash codes of rows seen
def is_duplicate(row):
m = hashlib.md5()
for c in ["column1", "column2", "column3"]:
m.update(row[c])
hash_code = m.digest()
if hash_code in hash_set:
return 1
hash_set.add(hash_code)
return 0
for df_path in [df1_path, df2_path, df3_path]: # iterate dataframes 1 by 1
df = pd.read_csv(df_path) # load the dataframe
df["duplicate"] = df.apply(is_duplicate, axis=1)
unique_df = df[df["duplicate"]==0] # a "globaly" unique dataframe
unique_df.pop("duplicate") # you don't need this column anymore
# YOUR CODE...

i would suggest two things.
First is to load the data frames into an rdbms if possible.
Then you can find duplicates by grouping key columns.
Second is, extract only the key columns from the big files and compare these with each other.
Try to sort the rows over the key columns in the files, so you can detect a duplicate by only compare one row with the next.
Hope that helps.

Related

For a requirement I need to transform a DataFrame into by creating rows out of values from of lists that are in a column of that dataFrame [duplicate]

This question already has answers here:
Pandas column of lists, create a row for each list element
(10 answers)
Closed 1 year ago.
I need to transform the below Dataframe into the required format without using a loop(or any other inefficient logic) as the size of dataframe is huge i.e., 950 thousand rows and also the value in the Points column has a list with lengths more than 1000. I'm getting this data after de-serializing a blob data from the database and will need to use this data create some ML Models.
input:
output:
for index,val in df.iterrows():
tempDF = pd.DataFrame(
[[
df['I'][index],df['x'][index],
df['y'][index],df['points'][index],
]]* int(df['points'][index]))
tempDF["Data"] = df['data'][index]
tempDF["index"] = list(range(1,int(df['k'][index])+1))
FinalDF = FinalDF.append(tempDF, ignore_index = True)
I have tried using for loop but for 950 thousand rows it takes so much time that using that logic is just not feasible. please help me in finding a pandas logic or if not then some other method to do that.
*I had to post screenshot because i was unable to post the dataframe with a table. Sorry I'm new to stackoverflow.
explode:
df.explode('points')

How to assign the same array of columns to multiple dataframes in Pandas?

I have 9 data sets. Between any 2 given data sets, they will share about 60-80% of the same columns. I want to concatenate these data sets into one data set. Due to some memory limitations, I can't load these datasets into data frames and use the concatenate function in pandas (but I can load each individual data set into a data frame). Instead, I am looking at an alternative solution.
I have created an ordered list of all columns which exist in these data sets. And I want to apply this column list to each of the individual 9 data sets. This way they will all have the same columns and are in the same order. Once that is done I will do a concatenate function on the flat files in the terminal, which will essentially append each data sets together, hopefully solving my issue and creating one single dataset of these 9.
The problem I am having is applying the ordered list to 9 data sets. I keep getting a KeyError "[[list of columns]] not in index" whenever I try to change the columns in the single data sets.
This is what I have been trying:
df = df[clist]
I have also tried
df = df.reindex(columns=clist)
but this doesn't create the extra columns in the data frame, it just orders them in the order that clist is in.
I expect the result to create 9 datasets which lineup on the same axis for an appends or concat operation outside pandas.
I just solved it.
the reindiex function does work. I was applying the reindex function outside of the list of dataframes I had created.
I loaded these 9 datasets with their first 9 rows into a list.
for filename in all_files:
df = pd.read(filename,nrows=10)
li.append(df)
And from that list I used the reindex as such
for i in range(0,9):
li[i]=li[i].reindex(columns=clist)

Most efficient way to compare two near identical CSV's in Python?

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.

Pandas - Drop function error (label not contained in axis) [duplicate]

This question already has answers here:
Delete a column from a Pandas DataFrame
(20 answers)
Closed 5 years ago.
I have a CSV file that is as the following:
index,Avg,Min,Max
Build1,56.19,39.123,60.1039
Build2,57.11,40.102,60.2
Build3,55.1134,35.129404123,60.20121
Based off my question here I am able to add some relevant information to this csv via this short script:
import pandas as pd
df = pd.read_csv('newdata.csv')
print(df)
df_out = pd.concat([df.set_index('index'),df.set_index('index').agg(['max','min','mean'])]).rename(index={'max':'Max','min':'Min','mean':'Average'}).reset_index()
with open('newdata.csv', 'w') as f:
df_out.to_csv(f,index=False)
This results in this CSV:
index,Avg,Min,Max
Build1,56.19,39.123,60.1039
Build2,57.11,40.102,60.2
Build3,55.1134,35.129404123,60.20121
Max,57.11,40.102,60.20121
Min,55.1134,35.129404123,60.1039
Average,56.1378,38.1181347077,60.16837
I would like to now have it so I can update this csv. For example if I ran a new build (build4 for instance) I could add that in and then redo the Max, Min, Average rows. My idea is that I therefore delete the rows with labels Max, Min, Average, add my new row, redo the stats. I believe the code I need is as simple as (just for Max but would have lines for Min and Average as well):
df = pd.read_csv('newdata.csv')
df = df.drop('Max')
However this always results in an ValueError: labels ['Max'] not contained in axis
I have created the csv files in sublime text, could this be part of the issue? I have read other SO posts about this and none seem to help my issue.
I am unsure if this allowed but here is a download link to my csv just in case something is wrong with the file itself.
I would be okay with two possible answers:
How to fix this drop issue
How to add more builds and update the statistics (a method without drop)
You must specify the axis argument. default is axis = 0 which is rows columns is axis = 1.
so this should be your code.
df = df.drop('Max',axis=1)
edit:
looking at this piece of code:
df = pd.read_csv('newdata.csv')
df = df.drop('Max')
The code you used does not specify that the first column of the csv file contains the index for the dataframe. Thus pandas creates an index on the fly. This index is purely a numerical one. So your index does not contain "Max".
try the following:
df = pd.read_csv("newdata.csv",index_col=0)
df = df.drop("Max",axis=0)
This forces pandas to use the first column in the csv file to be used as index. This should mean the code works now.
To delete a particular column in pandas; do simply:
del df['Max']

Using Pandas how do I deduplicate a file being read in chunks?

I have a large fixed width file being read into pandas in chunks of 10000 lines. This works great for everything except removing duplicates from the data because the duplicates can obviously be in different chunks. The file is being read in chunks because it is too large to fit into memory in its entirety.
My first attempt at deduplicating the file was to bring in just the two columns needed to deduplicate it and make a list of rows to not read. Reading in just those two columns (out of about 500) easily fits in memory and I was able to use the id column to find duplicates and an eligibility column to decide which of the two or three with the same id to keep. I then used the skiprows flag of the read_fwf() command to skip those rows.
The problem I ran into is that the Pandas fixed width file reader doesn't work with skiprows = [list] and iterator = True at the same time.
So, how do I deduplicate a file being processed in chunks?
My solution was to bring in just the columns needed to find the duplicates I want to drop and make a bitmask based on that information. Then, by knowing the chunksize and which chunk I'm on I reindex the chunk I'm on so that it matches the correct position it represents on the bitmask. Then I just pass it through the bitmask and the duplicate rows are dropped.
Bring in the entire column to deduplicate on, in this case 'id'.
Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated()
returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.
dupemask = ~df.duplicated(subset = ['id'])
Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.
for i, df in enumerate(chunked_data_iterator):
df.index = range(i*chunksize, i*chunksize + len(df.index))
df = df[dupemask]
This approach only works in this case because the data is large because its so wide. It still has to read in a column in its entirety in order to work.

Categories