I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column. I can clearly do it for each column separately:
for col in categorical_cols:
num = train[col].nunique().compute()
line = f'{col}\t{num}'
print(line)
However, the above code will go through the huge CSV file for each column, instead of going through the file only once. It takes a plenty of time, and I want it to be faster. If I would write it 'by hand' I would certainly do it with one scan of the file.
Can Dask compute the number of unique values in each column efficiently? Something like DataFrame.nunique() function in Pandas.
You can get the unique number of values in each non-numeric column using .describe()
df.describe(include=['object', 'category']).compute()
If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics. And obviously, getting the unique count of numeric data is not supported.
Have you tried the drop_duplicates() method, something like this
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=n)
ddf.drop_duplicates().compute()
Related
I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I'm trying to organize the columns in the dataframe based on datatype. I thought I'd do this by using pandas.loc to isolate datatypes of each column and then append them to each other to get one large organized dataset
import numpy as np
import pandas as pd
control = pd.read_csv(loan_path, chunksize=1000)
control = pd.concat(control, ignore_index=True)
int_columns= control.loc[:, control.dtypes==int]
I expect a new dataset with every row and only the columns that have integer datatypes. Instead I get the index of every row but 0 columns.
I know there are columns with integer datatypes. I've also tried looking for categories and floats and always get the same wrong result
I have two CSV's, each with about 1M lines, n number of columns, with identical columns. I want the most efficient way to compare the two files to find where any difference may lie. I would prefer to parse this data with Python rather than use any excel-related tools.
Are you using pandas?
import pandas as pd
df = pd.read_csv('file1.csv')
df = df.append(pd.read_csv('file2.csv'), ignore_index=True)
# array indicating which rows are duplicated
df[df.duplicated()]
# dataframe with only unique rows
df[~df.duplicated()]
# dataframe with only duplicate rows
df[df.duplicated()]
# number of duplicate rows present
df.duplicated().sum()
An efficient way would be to read each line from the first file(with less number of lines) and save in an object like Set or Dictionary, where you can access using O(1) complexity.
And then read lines from the second file and check if it exists in the Set or not.
I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).
I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?
Blocked Algorithms
One solution would be to look at dask.dataframe which implements a subset of the Pandas API with blocked algorithms.
import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()
In this particular case dd.DataFrame.drop_duplicates would pull out a medium-sized block of rows, perform the pd.DataFrame.drop_duplicates call and store the (hopefully smaller) result. It would do this for all blocks, concatenate them, and then perform a final pd.DataFrame.drop_duplicates on the concatenated intermediate result. You could also do this with just a for loop. Your case is a bit odd in that you also have a large number of unique elements. This might still be a challenge to compute even with blocked algorithms. Worth a shot though.
Column Store
Alternatively you should consider looking into a storage format that can store your data as individual columns. This would let you collect just the two columns that you need, A and B, rather than having to wade through all of your data on disk. Arguably you should be able to fit 80 million rows into a single Pandas dataframe in memory. You could consider bcolz for this.
To be clear, you tried something like this and it didn't work?
import pandas
import tables
import pandasql
check that your store is the type you think it is:
in: store
out: <class 'pandas.io.pytables.HDFStore'>
You can select a table from a store like this:
df = store.select('tablename')
Check that it worked:
in: type(tablename)
out: pandas.core.frame.DataFrame
Then you can do something like this:
q = """SELECT DISTINCT region, segment FROM tablename"""
distinct_df = (pandasql.sqldf(q, locals()))
(note that you will get deprecation warnings doing it this way, but it works)