Most efficient way to compare two near identical CSV's in Python? - python

I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.

Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()

An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.

Related

Pandas: how to keep data that has all the needed columns

I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.

Number of unique values in each Dask Dataframe column

I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column. I can clearly do it for each column separately:
for col in categorical_cols:
num = train[col].nunique().compute()
line = f'{col}\t{num}'
print(line)
However, the above code will go through the huge CSV file for each column, instead of going through the file only once. It takes a plenty of time, and I want it to be faster. If I would write it 'by hand' I would certainly do it with one scan of the file.
Can Dask compute the number of unique values in each column efficiently? Something like DataFrame.nunique() function in Pandas.
You can get the unique number of values in each non-numeric column using .describe()
df.describe(include=['object', 'category']).compute()
If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics. And obviously, getting the unique count of numeric data is not supported.
Have you tried the drop_duplicates() method, something like this
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=n)
ddf.drop_duplicates().compute()

How to read a dataset into pandas and leave out rows with uneven column count

I am trying to read a dataset, which has few rows with uneven column count ('ragged'). I want to leave out those rows and read the rest of the rows. Is it possible in pandas instead of breaking the dataset into separate data frames and combining them?
If I understand your question, you have uneven columns, but want to drop any rows that don't have every column. If so, simply read the entire data set (read_csv) and then call dropna() on the dataframe. dropna() has a swarg called 'how' which defaults to 'any' ... that is, if any of the items in the given row (or column) are NA. (Consider also doing 'inplace=True'). See also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

Get duplicated rows in larger-than-memory dataset with pandas [duplicate]

This question already has an answer here:
How to drop duplicated rows using pandas in a big data file?
(1 answer)
Closed 6 years ago.
The pandas.dataframe.duplicated is great for finding duplicate rows across specified columns within a dataframe.
However, my dataset is larger than what fits in memory (and even larger than what I could fit in after extending it within reasonable budget limits).
This is fine for most of the analyses that I have to execute since I can loop over my dataset (csv and dbf files), loading each file into memory on its own and do everything in sequence. However, regarding duplicate analysis, this is apparently not suitable for finding duplicates across the whole dataset but only within single files.
Is there any algorithm or approach for finding duplicates across multiple dataframes while not having to load them all into memory at the same time?
You can hash the values of the "key" columns and maintain a set of hash codes you already encountered:
import hashlib
hash_set = set() # this will contain all the hash codes of rows seen
def is_duplicate(row):
m = hashlib.md5()
for c in ["column1", "column2", "column3"]:
m.update(row[c])
hash_code = m.digest()
if hash_code in hash_set:
return 1
hash_set.add(hash_code)
return 0
for df_path in [df1_path, df2_path, df3_path]: # iterate dataframes 1 by 1
df = pd.read_csv(df_path) # load the dataframe
df["duplicate"] = df.apply(is_duplicate, axis=1)
unique_df = df[df["duplicate"]==0] # a "globaly" unique dataframe
unique_df.pop("duplicate") # you don't need this column anymore
# YOUR CODE...
i would suggest two things.
First is to load the data frames into an rdbms if possible.
Then you can find duplicates by grouping key columns.
Second is, extract only the key columns from the big files and compare these with each other.
Try to sort the rows over the key columns in the files, so you can detect a duplicate by only compare one row with the next.
Hope that helps.

Using Pandas how do I deduplicate a file being read in chunks?

I have a large fixed width file being read into pandas in chunks of 10000 lines. This works great for everything except removing duplicates from the data because the duplicates can obviously be in different chunks. The file is being read in chunks because it is too large to fit into memory in its entirety.
My first attempt at deduplicating the file was to bring in just the two columns needed to deduplicate it and make a list of rows to not read. Reading in just those two columns (out of about 500) easily fits in memory and I was able to use the id column to find duplicates and an eligibility column to decide which of the two or three with the same id to keep. I then used the skiprows flag of the read_fwf() command to skip those rows.
The problem I ran into is that the Pandas fixed width file reader doesn't work with skiprows = [list] and iterator = True at the same time.
So, how do I deduplicate a file being processed in chunks?
My solution was to bring in just the columns needed to find the duplicates I want to drop and make a bitmask based on that information. Then, by knowing the chunksize and which chunk I'm on I reindex the chunk I'm on so that it matches the correct position it represents on the bitmask. Then I just pass it through the bitmask and the duplicate rows are dropped.
Bring in the entire column to deduplicate on, in this case 'id'.
Then create a bitmask of the rows that AREN'T duplicates. DataFrame.duplicated()
returns the rows that are duplicates and the ~ inverts that. Now we have our 'dupemask'.
dupemask = ~df.duplicated(subset = ['id'])
Then create an iterator to bring the file in in chunks. Once that is done loop over the iterator and create a new index for each chunk. This new index matches the small chunk dataframe with its position in the 'dupemask' bitmask, which we can then use to only keep the lines that aren't duplicates.
for i, df in enumerate(chunked_data_iterator):
df.index = range(i*chunksize, i*chunksize + len(df.index))
df = df[dupemask]
This approach only works in this case because the data is large because its so wide. It still has to read in a column in its entirety in order to work.

Categories