Grouping by and performing count on columns WITHOUT pandas in Python - python

I have a dataframe with multiple columns, including family_ID and user_ID for a streaming platform. What I'm trying to do is find which family IDs have the most unique users associated to them within this dataframe
.
The SQL code for this would be:
SELECT TOP 5 family_id,
Count(distinct user_id) AS user_count
FROM log_edit
WHERE family_id <> ''
GROUP BY family_id
ORDER BY user_count DESC;
Using pandas I can get the same result using:
df.groupby('family_id')['user_id'].nunique().nlargest(5)
My question is, how can I get the same result without using Pandas or SQL at all? I can import the .csv using Pandas but have to do the analysis without it. What's the best way to approach this case?
If it's an array I assume the result would be something like [1,2,3,4,5] [9,7,7,7,5], where 1->5 are family ids and the other array is the number of user id's registered to them (sorted in descending order and limited to 5 results)
Thanks!

Since you put numpy among the tags, I am assuming you might want to use that. In that case, you can use np.unique:
import numpy as np
family_id = [9,7,7,7,5,5]
top_k = 2
unique_family_ids, counts = np.unique(family_id, return_counts=True)
# Use - to sort from largest to smallest
sort_idx = np.argsort(-counts)
for idx in sort_idx[:top_k]:
print(unique_family_ids[idx], 'has', counts[idx], 'unique user_id')
If you want to handle missing ids like in the SQL query you'll have to know how these are encoded exactly...

Related

What is the best way to query a pytable column with many values?

I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.

count how many times in an sqlite3 database table column the values occurs

I have been performing a query to count how many times in my sqlite3 database table (Users), within the column "country", the value "Australia" occurs.
australia = db.session.query(Users.country).filter_by(country="Australia").count()
I need to do this in a more dynamic way for any country value that may be within this column.
I have tried the following but unfortunately I only get a count of 0 for all values that are passed in the loop variable (each).
country = list(db.session.query(Users.country))
country_dict = list(set(country))
for each in country_dict:
print(db.session.query(Users.country).filter_by(country=(str(each))).count())
Any assistance would be greatly appreciated.
The issue is that country is a list of result tuples, not a list of strings. The end result is that the value of str(each) is something along the lines of ('Australia',), which should make it obvious why you are getting counts of 0 as results.
For when you want to extract a list of single column values, see here. When you want distinct results, use DISTINCT in SQL.
But you should not first query distinct countries and then fire a query to count the occurrence of each one. Instead use GROUP BY:
country_counts = db.session.query(Users.country, db.func.count()).\
group_by(Users.country).\
all()
for country, count in country_counts:
print(country, count)
The main thing to note is that SQLAlchemy does not hide the SQL when using the ORM, but works with it.
If you can use the sqlite3 module with direct SQL it is a simple query:
curs = con.execute("SELECT COUNT(*) FROM users WHERE country=?", ("Australia",))
nb = curs.fetchone()[0]

Getting distinct count of column on a SQLAlchemy query object?

Given a sqlalchemy.orm.query.Query object, is it possible to count distinct column on it? I am asking because .count() returns dupes due to the join conditions.
For instance:
from sqlalchemy import func, distinct
channels = db.session.query(Channel).join(ChannelUsers).filter(
ChannelUsers.user_id == USER_ID,
Message.channel_id.isnot(None)
).outerjoin(Message)
# this gives us a number with duplicate channels
# and .count() does not take extra parameters to target on column
channels.count()
...
# later on I need to access all these channels via channels.all()
To get a distinct channels count, I can do this by duplicating the filter condition above again and query the distinct column. Something like this
distinct_count = db.session.query(
func.count(distinct(Channel.id))
).join(ChannelUsers).filter(
ChannelUsers.user_id == USER_ID,
Message.channel_id.isnot(None)
).outerjoin(Message)
But that's not ideal as I need to access some or all channels after getting the distinct count.
Found this looking for the answer myself. After some more research, I was able to get the expected result using a combination of load_only and distinct in order to count only distinct values of an ID field. Let's say for simplicity that Channel has a unique field named id.
distinct_count = channels.options(load_only(Channel.id)).distinct().count()

SQL Alchemy Filter rows based on the values contained in cells of other column

I am new to python and SQLALCHEMY, and I came across this doubt, whether can we filter rows of the table based on cell values of the column of same table.
example:
Sbranch=value
result=Transaction.query.filter(Transaction.branch==Sbranch)
.order_by(desc(Transaction.id)).limit(50).all()
if the value of Sbranch=0, i want to read all the rows regardless of Sbranch value, else i want to filter rows with contains Transaction.branch==Sbranch.
I know that it can be achieved by comparing the values of Sbranch(if-else conditions),but it gets complicated as the number of such columns increases.
Example:
Sbranch=value1
trans_by=value2
trans_to=value3
.
.
result=Transaction.query.filter(Transaction.branch==Sbranch,Transaction.trans_by==value2,Transaction_to==trans_to)
.order_by(desc(Transaction.id)).limit(50).all()
I want to apply similar filter with all 3 columns.
I want to know if there is any inbuilt function in SQLALCHEMY to this problem.
You can optionally add the filter based on the value of SBranch
query = Transaction.query
if SBranch != 0:
query = query.filter(Transaction.branch == SBranch)
result = query.order_by(Transaction.id.desc()).limit(50).all()
I think i found the solution, it's not the best but will reduce the work for the developers(not the processor).
Sbranch=value
branches=[]
if Sbranch==0:
# Append all the values into the array for which the rows are filtered
# for example:
branches=[1,2,4,7,3,8]
else:
branches.append(branch)
result=Transaction.query.filter(Transaction.branch.in_(branches))
.order_by(desc(Transaction.id)).limit(50).all()

pandas update multiple fields

I am trying to add and update multiple columns in a pandas dataframe using a second dataframe. The problem I get is when the number of columns I want to add doesn't match the number of columns in the base dataframe I get the following error: "Shape of passed values is (2, 3), indices imply (2, 2)"
A simplified version of the problem is below
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2, b ** 3
#create three new fields from the data
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
if the number of fields being added matches the number already in the table the opertaion works as expected.
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2
#create three new fields from the data
tst[["One^2", "Two^2"]] = tst.apply(square, axis=1)
I realise I could do each field seperately but in the actual problem I am trying to solve I perform a join between the table being updated and an external table within the "updater" (i.e. square) and want to be able to grab all the required information at once.
Below is how I would do it in SQL. Unfortunately the two dataframes contain data from different database technologies, hence why I have to do perform the operation in pandas.
update tu
set tu.a_field = upd.the_field_i_want
tu.another_field = upd.the_second_required_field
from to_update tu
inner join the_updater upd
on tu.item_id = upd.item_id
and tu.date between upd.date_from and upd.date_to
Here you can see the exact details of what I am trying to do. I have a table "to_update" that contains point-in-time information against an item_id. The other table "the_updater" contains date range information against the item_id. For example a particular item_id may sit with customer_1 from DateA to DateB and with customer_2 between DateB and DateC etc. I want to be able to align information from the table containing the date ranges against the point-in-time table.
Please note a merge won't work due to problems with the data (this is actually being written as part of a dataquality test). I really need to be able to replicate the functionality of the update statement above.
I could obviously do it as a loop but I was hoping to use the pandas framework where possible.
Declare a empty column in dataframe and assign it to zero
tst["Two^3"] = 0
Then do the respective operations for that column, along with other columns
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
Try printing it
print tst.head(5)

Categories