Interpolate in SQL based on subgroup in django models - python

I have the following sheetinfo model with the following data:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 0 |
| SAT123A03 | SAT123 | A | 3 | 0 |
| SAT123A04 | SAT123 | A | 4 | 0 |
| SAT123A05 | SAT123 | A | 5 | 500 |
| SAT123B05 | SAT123 | B | 5 | 400 |
| SAT123B04 | SAT123 | B | 4 | 0 |
| SAT123B03 | SAT123 | B | 3 | 0 |
| SAT123B02 | SAT123 | B | 2 | 500 |
| SAT124A01 | SAT124 | A | 1 | 400 |
| SAT124A02 | SAT124 | A | 2 | 0 |
| SAT124A03 | SAT124 | A | 3 | 0 |
| SAT124A04 | SAT124 | A | 4 | 475 |
I would like to interpolate and update the table with the correct T_val.
Formula is:
new_t_val = delta / (cnt -1) * sheet_num + min_tvc_of_subgroup
For instance the top 5:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 425 |
| SAT123A03 | SAT123 | A | 3 | 450 |
| SAT123A04 | SAT123 | A | 4 | 475 |
| SAT123A05 | SAT123 | A | 5 | 500 |
I have a django query that works to update the data, however it is SLOW and stops after a while (due to type errors etc.)
My question is there a way to accomplish this in SQL?

The ability to do this as one database call doesn't exist in stock Django. 3rd party packages exist though: https://github.com/aykut/django-bulk-update
Example of how that package works:
rows = Model.objects.all()
for row in rows:
# Modify rows as appropriate
row.T_val = delta / (cnt -1) * row.sheet_num + min_tvc_of_subgroup
Model.objects.bulk_update(rows)
For datasets up to the 1,000,000 range, this should have reasonable performance. Most of the bottleneck in iterating through and .save()-ing each object is the overhead on a database call. The python part is reasonably fast. The above example has only two database calls so it will be perhaps an order of magnitude or two faster.

Related

Python Pivot Table based on multiple criteria

I was asking the question in this link SUMIFS in python jupyter
However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in.
Here is the dataframe (sorted based on the date):
+---------------+--------+---------+-----------+--------+
| Switch In/Out | Client | Quality | Date | Amount |
+---------------+--------+---------+-----------+--------+
| Out | 1 | B | 15-Aug-19 | 360 |
| In | 1 | A | 16-Aug-19 | 180 |
| In | 1 | B | 17-Aug-19 | 180 |
| Out | 1 | A | 18-Aug-19 | 140 |
| In | 1 | B | 18-Aug-19 | 80 |
| In | 1 | A | 19-Aug-19 | 60 |
| Out | 2 | B | 14-Aug-19 | 45 |
| Out | 2 | C | 15-Aug-20 | 85 |
| In | 2 | C | 15-Aug-20 | 130 |
| Out | 2 | A | 20-Aug-19 | 100 |
| In | 2 | A | 22-Aug-19 | 30 |
| In | 2 | B | 23-Aug-19 | 30 |
| In | 2 | C | 23-Aug-19 | 40 |
+---------------+--------+---------+-----------+--------+
I would then create a new column and divide them into different transactions.
+---------------+--------+---------+-----------+--------+------+
| Switch In/Out | Client | Quality | Date | Amount | Rows |
+---------------+--------+---------+-----------+--------+------+
| Out | 1 | B | 15-Aug-19 | 360 | 1 |
| In | 1 | A | 16-Aug-19 | 180 | 1 |
| In | 1 | B | 17-Aug-19 | 180 | 1 |
| Out | 1 | A | 18-Aug-19 | 140 | 2 |
| In | 1 | B | 18-Aug-19 | 80 | 2 |
| In | 1 | A | 19-Aug-19 | 60 | 2 |
| Out | 2 | B | 14-Aug-19 | 45 | 3 |
| Out | 2 | C | 15-Aug-20 | 85 | 3 |
| In | 2 | C | 15-Aug-20 | 130 | 3 |
| Out | 2 | A | 20-Aug-19 | 100 | 4 |
| In | 2 | A | 22-Aug-19 | 30 | 4 |
| In | 2 | B | 23-Aug-19 | 30 | 4 |
| In | 2 | C | 23-Aug-19 | 40 | 4 |
+---------------+--------+---------+-----------+--------+------+
With this, I can apply the pivot formula and take it from there.
However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python.
Thank you!
One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map.
Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter.
import pandas as pd
switchInOut = ["Out", "In", "In", "Out", "In", "In",
"Out", "Out", "In", "Out", "In", "In", "In"]
df = pd.DataFrame(switchInOut, columns=['Switch In/Out'])
counter = 1
def changeToOut(i):
global counter
if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In":
counter += 1
return counter
rows = df.index.map(changeToOut)
df["Rows"] = rows
df
Result:
+----+-----------------+--------+
| | Switch In/Out | Rows |
|----+-----------------+--------|
| 0 | Out | 1 |
| 1 | In | 1 |
| 2 | In | 1 |
| 3 | Out | 2 |
| 4 | In | 2 |
| 5 | In | 2 |
| 6 | Out | 3 |
| 7 | Out | 3 |
| 8 | In | 3 |
| 9 | Out | 4 |
| 10 | In | 4 |
| 11 | In | 4 |
| 12 | In | 4 |
+----+-----------------+--------+

How to create a table resulting from joining of two or more table with this structure?

Lets say I have two tables with the following structure and same values-
+-----------+-----------+---------+-------+--------+---------+--------+---------+
| TEACHER | STUDENT | CLASS | SEC | HB_a | VHB_b | HG_c | VHG_d |
|-----------+-----------+---------+-------+--------+---------+--------+---------|
| 1 | - | - | - | 1 | 1 | 1 | 1 |
| - | 1 | 10 | D | 1 | 1 | 1 | 1 |
| - | 1 | 9 | D | 1 | 1 | 1 | 1 |
+-----------+-----------+---------+-------+--------+---------+--------+---------+
CLASS can go from 6-12 and SEC from A-Z,
*There's nothing in STUDENT, CLASS, SEC while there's some value in TEACHER and Vice-versa .
Now i want to create a table joining two tables with exact structure and data given above... I.e, I want the result to be something like below-
+-----------+-----------+---------+-------+--------+---------+--------+---------+
| TEACHER | STUDENT | CLASS | SEC | HB_a | VHB_b | HG_c | VHG_d |
|-----------+-----------+---------+-------+--------+---------+--------+---------|
| 2 | - | - | - | 2 | 2 | 2 | 2 |
| - | 2 | 10 | D | 2 | 2 | 2 | 2 |
| - | 2 | 9 | D | 2 | 2 | 2 | 2 |
+-----------+-----------+---------+-------+--------+---------+--------+---------+
I tried something like this but it doesn't work well, the output isn't what I want-
__tbl_sy = f"""
CREATE TABLE <tbl>
AS SELECT CLASS, SEC, SUM(TEACHER), SUM(STUDENT), SUM(HB_a), SUM(VHB_b), SUM(HG_c), SUM(VHG_d)
FROM <tbl1>
UNION
SELECT CLASS, SEC, SUM(TEACHER), SUM(STUDENT), SUM(HB_a), SUM(VHB_b), SUM(HG_c), SUM(VHG_d)
FROM <tbl2>
GROUP BY CLASS, SEC
"""
Cursor.execute(__tbl_sy)
For the sample data that you posted this will work:
select
sum(teacher) teacher, sum(student) student,
class, sec,
sum(hb_a) hb_a, sum(vhb_b) vhb_b, sum(hg_c) hg_c, sum(vhg_d) vhg_d
from (
select * from tbl1
union all
select * from tbl2
)
group by class, sec
See the demo.
Results:
| teacher | student | CLASS | SEC | hb_a | vhb_b | hg_c | vhg_d |
| ------- | ------- | ----- | --- | ---- | ----- | ---- | ----- |
| 2 | | | | 2 | 2 | 2 | 2 |
| | 2 | 10 | D | 2 | 2 | 2 | 2 |
| | 2 | 9 | D | 2 | 2 | 2 | 2 |

Manipulate pandas columns with datetime

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

Using pandas as a distance matrix and then getting a sub-dataframe of relevant distances

I have created a pandas df that has the distances between location i and location j. Beginning with a start point P1 and end point P2, I want to find the sub-dataframe (distance matrix) that has one axis of the df having P1, P2 and the other axis having the rest of the indices.
I'm using a Pandas DF because I think its' the most efficient way
dm_dict = # distance matrix in dict form where you can call dm_dict[i][j] and get the distance from i to j
dm_df = pd.DataFrame().from_dict(dm_dict)
P1 = dm_df.max(axis=0).idxmax()
P2 = dm_df[i].idxmax()
route = [i, j]
remaining_locs = dm_df[dm_df[~dm_df.isin(route)].isin(route)]
while not_done:
# go through the remaining_locs until found all the locations are added.
No error messages, but the remaining_locs df is full of nan's rather than a df with the distances.
using dm_df[~dm_df.isin(route)].isin(route) seems to give me a boolean df that is accurate.
sample data, it's technically the haversine distance but the euclidean should be fine for filling up the matrix:
import numpy
def dist(i, j):
a = numpy.array((i[1], i[2]))
b = numpy.array((j[1], j[2]))
return numpy.linalg.norm(a-b)
locations = [
("Ottawa", 45.424722,-75.695),
("Edmonton", 53.533333,-113.5),
("Victoria", 48.428611,-123.365556),
("Winnipeg", 49.899444,-97.139167),
("Fredericton", 49.899444,-97.139167),
("StJohns", 47.561389, -52.7125),
("Halifax", 44.647778, -63.571389),
("Toronto", 43.741667, -79.373333),
("Charlottetown",46.238889, -63.129167),
("QuebecCity",46.816667, -71.216667 ),
("Regina", 50.454722, -104.606667),
("Yellowknife", 62.442222, -114.3975),
("Iqaluit", 63.748611, -68.519722)
]
dm_dict = {i: {j: dist(i, j) for j in locations if j != i} for i in locations}
It looks like you want scipy's distance_matrix:
df = pd.DataFrame(locations)
x = df[[1,2]]
dm = pd.DataFrame(distance_matrix(x,x),
index=df[0],
columns=df[0])
Output:
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| | Ottawa | Edmonton | Victoria | Winnipeg | Fredericton | StJohns | Halifax | Toronto | Charlottetown | QuebecCity | Regina | Yellowknife | Iqaluit |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| 0 | | | | | | | | | | | | | |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
| Ottawa | 0.000000 | 38.664811 | 47.765105 | 21.906059 | 21.906059 | 23.081609 | 12.148481 | 4.045097 | 12.592181 | 4.689667 | 29.345960 | 42.278586 | 19.678657 |
| Edmonton | 38.664811 | 0.000000 | 11.107987 | 16.759535 | 16.759535 | 61.080146 | 50.713108 | 35.503607 | 50.896264 | 42.813477 | 9.411122 | 8.953983 | 46.125669 |
| Victoria | 47.765105 | 11.107987 | 0.000000 | 26.267600 | 26.267600 | 70.658378 | 59.913580 | 44.241193 | 60.276176 | 52.173796 | 18.867990 | 16.637528 | 56.945306 |
| Winnipeg | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| Fredericton | 21.906059 | 16.759535 | 26.267600 | 0.000000 | 0.000000 | 44.488147 | 33.976105 | 18.802741 | 34.206429 | 26.105163 | 7.488117 | 21.334745 | 31.794214 |
| StJohns | 23.081609 | 61.080146 | 70.658378 | 44.488147 | 44.488147 | 0.000000 | 11.242980 | 26.933071 | 10.500284 | 18.519147 | 51.974763 | 63.454538 | 22.625084 |
| Halifax | 12.148481 | 50.713108 | 59.913580 | 33.976105 | 33.976105 | 11.242980 | 0.000000 | 15.827902 | 1.651422 | 7.946971 | 41.444115 | 53.851052 | 19.731392 |
| Toronto | 4.045097 | 35.503607 | 44.241193 | 18.802741 | 18.802741 | 26.933071 | 15.827902 | 0.000000 | 16.434995 | 8.717042 | 26.111037 | 39.703942 | 22.761342 |
| Charlottetown | 12.592181 | 50.896264 | 60.276176 | 34.206429 | 34.206429 | 10.500284 | 1.651422 | 16.434995 | 0.000000 | 8.108112 | 41.691201 | 53.767927 | 18.320711 |
| QuebecCity | 4.689667 | 42.813477 | 52.173796 | 26.105163 | 26.105163 | 18.519147 | 7.946971 | 8.717042 | 8.108112 | 0.000000 | 33.587610 | 45.921044 | 17.145385 |
| Regina | 29.345960 | 9.411122 | 18.867990 | 7.488117 | 7.488117 | 51.974763 | 41.444115 | 26.111037 | 41.691201 | 33.587610 | 0.000000 | 15.477744 | 38.457705 |
| Yellowknife | 42.278586 | 8.953983 | 16.637528 | 21.334745 | 21.334745 | 63.454538 | 53.851052 | 39.703942 | 53.767927 | 45.921044 | 15.477744 | 0.000000 | 45.896374 |
| Iqaluit | 19.678657 | 46.125669 | 56.945306 | 31.794214 | 31.794214 | 22.625084 | 19.731392 | 22.761342 | 18.320711 | 17.145385 | 38.457705 | 45.896374 | 0.000000 |
+----------------+------------+------------+------------+------------+--------------+------------+------------+------------+----------------+-------------+------------+--------------+-----------+
I am pretty sure this is what I wanted:
filtered = dm_df.filter(items=route,axis=1).filter(items=set(locations).difference(set(route)), axis=0)
filtered is a df with [2 rows x 10 columns] and then I can find the minimum value from there

Saving Graphlab LDA model turns topics into gibberish?

Ok, this is just plain wacky. I think the problem may may have been introduced by a recent graphlab update, because I've never seen the issue before, but I'm not sure). Anyway, check this out:
import graphlab as gl
corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)
+-------+---------------+------------------+
| topic | word | score |
+-------+---------------+------------------+
| 0 | Music | 0.0195325651638 |
| 0 | Love | 0.0120906781994 |
| 0 | Photography | 0.00936914065591 |
| 1 | Recipe | 0.0205673829742 |
| 1 | Food | 0.0202932111556 |
| 1 | Sugar | 0.0162560126511 |
| 2 | Business | 0.0223993672813 |
| 2 | Science | 0.0164027313084 |
| 2 | Education | 0.0139221301443 |
| 3 | Science | 0.0134658216431 |
| 3 | Video_game | 0.0113924173881 |
| 3 | NASA | 0.0112188654905 |
| 4 | United_States | 0.0127908290673 |
| 4 | Automobile | 0.00888669047383 |
| 4 | Australia | 0.00854809547772 |
| 5 | Disease | 0.00704245203928 |
| 5 | Earth | 0.00693360028027 |
| 5 | Species | 0.00648700544757 |
| 6 | Religion | 0.0142311765509 |
| 6 | God | 0.0139990904439 |
| 6 | Human | 0.00765681454222 |
| 7 | Google | 0.0198547267697 |
| 7 | Internet | 0.0191105480317 |
| 7 | Computer | 0.0179914269911 |
| 8 | Art | 0.0378733245262 |
| 8 | Design | 0.0223646138082 |
| 8 | Artist | 0.0142755732766 |
| 9 | Film | 0.0205971724156 |
| 9 | Earth | 0.0125386246077 |
| 9 | Television | 0.0102082224947 |
+-------+---------------+------------------+
Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, insofar as the top terms per topic are more or less related.
But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):
lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)
+-------+-----------------------+-------------------+
| topic | word | score |
+-------+-----------------------+-------------------+
| 0 | Cleanliness | 0.00468171463384 |
| 0 | Chicken_soup | 0.00326753275774 |
| 0 | The_Language_Instinct | 0.00314506174959 |
| 1 | Equalization | 0.0015724652078 |
| 1 | Financial_crisis | 0.00132675410371 |
| 1 | Tulsa,_Oklahoma | 0.00118899041288 |
| 2 | Batoidea | 0.00142300468887 |
| 2 | Abbottabad | 0.0013474225953 |
| 2 | Migration_humaine | 0.00124284781396 |
| 3 | Gewürztraminer | 0.00147470845039 |
| 3 | Indore | 0.00107223358321 |
| 3 | White_wedding | 0.00104791136102 |
| 4 | Bregenz | 0.00130871351963 |
| 4 | Carl_Jung | 0.000879345016186 |
| 4 | ภ | 0.000855001542873 |
| 5 | 18e_eeuw | 0.000950866105797 |
| 5 | Vesuvianite | 0.000832367570269 |
| 5 | Gary_Kirsten | 0.000806410748201 |
| 6 | Sunday_Bloody_Sunday | 0.000828552346797 |
| 6 | Linear_cryptanalysis | 0.000681188343324 |
| 6 | Clothing_sizes | 0.00066708652481 |
| 7 | Mile | 0.000759081990574 |
| 7 | Pinwheel_calculator | 0.000721971708181 |
| 7 | Third_Age | 0.000623010955132 |
| 8 | Tennessee_Williams | 0.000597449568381 |
| 8 | Levite | 0.000551338743949 |
| 8 | Time_Out_(company) | 0.000536667117994 |
| 9 | David_Deutsch | 0.000543813843275 |
| 9 | Honing_(metalworking) | 0.00044496051774 |
| 9 | Clearing_(finance) | 0.000431699705779 |
+-------+-----------------------+-------------------+
Any idea what could possible be happening here? save should just pickle the model, so I don't see where the weirdness is happening, but somehow the topic distributions are getting totally changed around in some non-obvious way. I've verified this on two different machines (Linux and Mac). with similar weird results.
EDIT
Downgrading Graphlab from 1.7.1 to 1.6.1 seems to resolve this issue, but that's not a real solution. I don't see anything obvious in the 1.7.1 release notes to explain what happened, and would like this to work in 1.7.1 if possible...
This is a bug in Graphlab create 1.7.1. It has now been fixed in Graphlab Create 1.8.

Categories