How to SELECT DISTINCT from a pandas hdf5store? - python

I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).
I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?

Blocked Algorithms
One solution would be to look at dask.dataframe which implements a subset of the Pandas API with blocked algorithms.
import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()
In this particular case dd.DataFrame.drop_duplicates would pull out a medium-sized block of rows, perform the pd.DataFrame.drop_duplicates call and store the (hopefully smaller) result. It would do this for all blocks, concatenate them, and then perform a final pd.DataFrame.drop_duplicates on the concatenated intermediate result. You could also do this with just a for loop. Your case is a bit odd in that you also have a large number of unique elements. This might still be a challenge to compute even with blocked algorithms. Worth a shot though.
Column Store
Alternatively you should consider looking into a storage format that can store your data as individual columns. This would let you collect just the two columns that you need, A and B, rather than having to wade through all of your data on disk. Arguably you should be able to fit 80 million rows into a single Pandas dataframe in memory. You could consider bcolz for this.

To be clear, you tried something like this and it didn't work?
import pandas
import tables
import pandasql
check that your store is the type you think it is:
in: store
out: <class 'pandas.io.pytables.HDFStore'>
You can select a table from a store like this:
df = store.select('tablename')
Check that it worked:
in: type(tablename)
out: pandas.core.frame.DataFrame
Then you can do something like this:
q = """SELECT DISTINCT region, segment FROM tablename"""
distinct_df = (pandasql.sqldf(q, locals()))
(note that you will get deprecation warnings doing it this way, but it works)

Related

Pandas: how to keep data that has all the needed columns

I have this big csv file that has data from an experiment. The first part of each person's response is a trial part that doesn't have the time they took for each response and I don't need that. After that part, the data adds another column which is the time, and those are the rows I need. So, basically, the csv has a lot of unusable data that has 9 columns instead of 10 and I need only the data with the 10 columns. How can I manage to grab that data instead of all of it?
As an example of it, the first row shows the data without the time column (second to last) and the second row the data I need with the time column added. I only need all the second rows basically, which is thousands of them. Any tips would be appreciated.
1619922425,5fe43773223070f515613ba23f3b770c,PennController,7,0,experimental-trial2,NULL,PennController,9,_Trial_,End,1619922289638,FLOR, red, r,NULL
1619922425,5fe43773223070f515613ba23f3b770c,PennController,55,0,experimental-trial,NULL,PennController,56,_Trial_,Start,1619922296066,CASA, red, r,1230,NULL
Read the CSV using pandas. Then filter by using df[~df.time.isna()] to select all rows with non NaN values in the "time" column.
You can change this to filter based on the presence of data in any column. Think of it as a mask (i.e. mask = (~df.time.isna()) flags rows as True/False depending on the condition.
One option is to load the whole file and then keep only valid data:
import pandas as pd
df = pd.read_csv("your_file.csv")
invalid_rows = df.iloc[:,-1].isnull() # Find rows, where last column is not valid (missing)
df = df[~invalid_rows] # Select only valid rows
If you have columns named, then you can use df['column_name'] instead of df.iloc[:,-1].
Of course it means you first load the full dataset, but in many cases this is not a problem.

Filter nan values out of rows in pandas

I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.
You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

Number of unique values in each Dask Dataframe column

I have a Dask Dataframe called train which is loaded from a large CSV file, and I would like to count the number of unique value in each column. I can clearly do it for each column separately:
for col in categorical_cols:
num = train[col].nunique().compute()
line = f'{col}\t{num}'
print(line)
However, the above code will go through the huge CSV file for each column, instead of going through the file only once. It takes a plenty of time, and I want it to be faster. If I would write it 'by hand' I would certainly do it with one scan of the file.
Can Dask compute the number of unique values in each column efficiently? Something like DataFrame.nunique() function in Pandas.
You can get the unique number of values in each non-numeric column using .describe()
df.describe(include=['object', 'category']).compute()
If you have category columns with dtype int/float, you would have to convert those columns to categories before applying .describe() to get unique-count statistics. And obviously, getting the unique count of numeric data is not supported.
Have you tried the drop_duplicates() method, something like this
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=n)
ddf.drop_duplicates().compute()

Consolidating a ton of mostly blank columns into one column (Postgres and Python)

I have a raw data table with over 500 columns that I am importing to a different database. Most of these columns are null (for example: session1, session2, session3 ~ session120). I didn't design this table, but there are 3 column types with over 100 columns each. Most would not need to be used unless it was for some very specific analysis or investigation (if ever).
Is there a nice way to combine these columns into a consolidated column which can be 'unpacked' later? I don't want to lose the information in case there is something important.
Here is my naive approach (using pandas to modify the raw data before inserting it into postgres):
column_list = []
for val in range(10, 120):
column_list.append('session' + str(val))
df['session_10_to_120'] = df[column_list ].astype(str).sum(axis=1).replace('', ',', regex = True)\n",
for col in column_list :
df.drop(col, axis=1, inplace=True)
I don't want to mess up my COPY statements to postgres (where it might think that the commas are separate columns).
Any recommendations? What is the best practice here?
I depends on what you want to do with these columns, but options include
arrays
non-relational storage: hstore, json, xml
turning the columns into rows in another table

Categories