I'm trying to join two tables using SQLite according to unique case id numbers.
If I try running this..
SELECT *
FROM collisions c
INNER JOIN parties p ON c.case_id = p.case_id
WHERE date(collision_date) BETWEEN date('2020-01-01') and date('2021-12-31')
I get an error saying database/disk is full.
I manage to get by creating two databases based on unique ID number, as such..
query_two = """
SELECT *
FROM parties
WHERE date(case_id) BETWEEN '3982887' and '3984887'
"""
query = """
SELECT *
FROM collisions
WHERE date(case_id) BETWEEN '3982887' and '3984887'
"""
and merging them together like this
concat = pd.merge(df_one, df_two, on = 'case_id', how = 'inner')
But this gives me a random sample and it so happens that these case ids include collisions from 2007.
I want to be more specific and join only cases with a specific date range of 2020-01-01 to 2021-12-31.
Note: The parties table doesn't have collision_date - so the only way to join both tables is on case_id.
Is there a workaround to this?
Thanks!
Related
I have two querysets -
A = Bids.objects.filter(*args,**kwargs).annotate(highest_priority=Case(*[
When(data_source=data_source, then Value(i))
for i, data_source in enumerate(data_source_order_list)
],
.order_by(
"date",
"highest_priority"
))
B= A.values("date").annotate(Min("highest_priority)).order_by("date")
First query give me all objects with selected time range with proper data sources and values. Through highest_priority i set which item should be selected. All items have additional data.
Second query gives me grouped by information about items in every date. In second query i do not have important values like price etc. So i assume i have to join these two tables and filter out where a.highest_priority = b.highest priority. Because in this case i will get queryset with objects and only one item per date.
I have tried using distinct - not working with .first()/.last(). Annotates gives me dict by grouped by, and grouping by only date cutting a lot of important data, but i have to group by only date...
Tables looks like that
A
B
How to join them? Because when i join them i could easily filter highest_prio with highest_prio and get my date with only one database shot. I want to use ORM, because i could just distinct and put it on the list and i do not want to hammer base with connecting multiple queries through date.
Look if this sugestion works :
SELECT * , (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority AS prioritycalc;
FROM table A
JOIN table B ON (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority = (to_char(b.date, 'YYYYMMDD')::integer)*highest_priority
ORDER BY prioritycalc DESC;
I have a dataframe with multiple columns, including family_ID and user_ID for a streaming platform. What I'm trying to do is find which family IDs have the most unique users associated to them within this dataframe
.
The SQL code for this would be:
SELECT TOP 5 family_id,
Count(distinct user_id) AS user_count
FROM log_edit
WHERE family_id <> ''
GROUP BY family_id
ORDER BY user_count DESC;
Using pandas I can get the same result using:
df.groupby('family_id')['user_id'].nunique().nlargest(5)
My question is, how can I get the same result without using Pandas or SQL at all? I can import the .csv using Pandas but have to do the analysis without it. What's the best way to approach this case?
If it's an array I assume the result would be something like [1,2,3,4,5] [9,7,7,7,5], where 1->5 are family ids and the other array is the number of user id's registered to them (sorted in descending order and limited to 5 results)
Thanks!
Since you put numpy among the tags, I am assuming you might want to use that. In that case, you can use np.unique:
import numpy as np
family_id = [9,7,7,7,5,5]
top_k = 2
unique_family_ids, counts = np.unique(family_id, return_counts=True)
# Use - to sort from largest to smallest
sort_idx = np.argsort(-counts)
for idx in sort_idx[:top_k]:
print(unique_family_ids[idx], 'has', counts[idx], 'unique user_id')
If you want to handle missing ids like in the SQL query you'll have to know how these are encoded exactly...
Imagine one has two SQL tables
objects_stock
id | number
and
objects_prop
id | obj_id | color | weight
that should be joined on objects_stock.id=objects_prop.obj_id, hence the plain SQL-query reads
select * from objects_prop join objects_stock on objects_stock.id = objects_prop.obj_id;
How can this query be performed with SqlAlchemy such that all returned columns of this join are accessible?
When I execute
query = session.query(ObjectsStock).join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
results = query.all()
with ObjectsStock and ObjectsProp the appropriate mapped classes, the list results contains objects of type ObjectsStock - why is that? What would be the correct SqlAlchemy-query to get access to all fields corresponding to the columns of both tables?
Just in case someone encounters a similar problem: the best way I have found so far is listing the columns to fetch explicitly,
query = session.query(ObjectsStock.id, ObjectsStock.number, ObjectsProp.color, ObjectsProp.weight).\
select_from(ObjectsStock).join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
results = query.all()
Then one can iterate over the results and access the properties by their original column names, e.g.
for r in results:
print(r.id, r.color, r.number)
A shorter way of achieving the result of #ctenar's answer is by unpacking the columns using the star operator:
query = (
session
.query(*ObjectsStock.__table__.columns, *ObjectsProp.__table__.columns)
.select_from(ObjectsStock)
.join(ObjectsProp, ObjectsStock.id == ObjectsProp.obj_id)
)
results = query.all()
This is useful if your tables have many columns.
I have two tables, one contains SCHEDULE_DATE (over 300,000 records) and WORK_WEEK_CODE, and the second table contains WORK_WEEK_CODE, START_DATE, and END_DATE. The first table has duplicate schedule dates, and the second table is 3,200 unique values. I need to populate the WORK_WEEK_CODE in table one with the WORK_WEEK_CODE from table two, based off of the range where the schedule date falls. Samples of the two tables are below.
I was able to accomplish the task using arcpy.da.UpdateCursor with a nested arcpy.da.SearchCursor, but with the volume of records, it takes a long time. Any suggestions on a better (and less time consuming) method would be greatly appreciated.
Note: The date fields are formatted as string
Table 1
SCHEDULE_DATE,WORK_WEEK_CODE
20160219
20160126
20160219
20160118
20160221
20160108
20160129
20160201
20160214
20160127
Table 2
WORK_WEEK_CODE,START_DATE,END_DATE
1601,20160104,20160110
1602,20160111,20160117
1603,20160118,20160124
1604,20160125,20160131
1605,20160201,20160207
1606,20160208,20160214
1607,20160215,20160221
You can use Pandas dataframes as a more efficient method. Here is the approach using Pandas. Hope this helps:
import pandas as pd
# First you need to convert your data to Pandas Dataframe I read them from csv
Table1 = pd.read_csv('Table1.csv')
Table2 = pd.read_csv('Table2.csv')
# Then you need to add a shared key for join
Table1['key'] = 1
Table2['key'] = 1
#The following line joins the two tables
mergeddf = pd.merge(Table1,Table2,how='left',on='key')
#The following line converts the string dates to actual dates
mergeddf['SCHEDULE_DATE'] = pd.to_datetime(mergeddf['SCHEDULE_DATE'],format='%Y%m%d')
mergeddf['START_DATE'] = pd.to_datetime(mergeddf['START_DATE'],format='%Y%m%d')
mergeddf['END_DATE'] = pd.to_datetime(mergeddf['END_DATE'],format='%Y%m%d')
#The following line will filter and keep only lines that you need
result = mergeddf[(mergeddf['SCHEDULE_DATE'] >= mergeddf['START_DATE']) & (mergeddf['SCHEDULE_DATE'] <= mergeddf['END_DATE'])]
I am trying to add and update multiple columns in a pandas dataframe using a second dataframe. The problem I get is when the number of columns I want to add doesn't match the number of columns in the base dataframe I get the following error: "Shape of passed values is (2, 3), indices imply (2, 2)"
A simplified version of the problem is below
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2, b ** 3
#create three new fields from the data
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
if the number of fields being added matches the number already in the table the opertaion works as expected.
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2
#create three new fields from the data
tst[["One^2", "Two^2"]] = tst.apply(square, axis=1)
I realise I could do each field seperately but in the actual problem I am trying to solve I perform a join between the table being updated and an external table within the "updater" (i.e. square) and want to be able to grab all the required information at once.
Below is how I would do it in SQL. Unfortunately the two dataframes contain data from different database technologies, hence why I have to do perform the operation in pandas.
update tu
set tu.a_field = upd.the_field_i_want
tu.another_field = upd.the_second_required_field
from to_update tu
inner join the_updater upd
on tu.item_id = upd.item_id
and tu.date between upd.date_from and upd.date_to
Here you can see the exact details of what I am trying to do. I have a table "to_update" that contains point-in-time information against an item_id. The other table "the_updater" contains date range information against the item_id. For example a particular item_id may sit with customer_1 from DateA to DateB and with customer_2 between DateB and DateC etc. I want to be able to align information from the table containing the date ranges against the point-in-time table.
Please note a merge won't work due to problems with the data (this is actually being written as part of a dataquality test). I really need to be able to replicate the functionality of the update statement above.
I could obviously do it as a loop but I was hoping to use the pandas framework where possible.
Declare a empty column in dataframe and assign it to zero
tst["Two^3"] = 0
Then do the respective operations for that column, along with other columns
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
Try printing it
print tst.head(5)