group_by with Impala Ibis - python

I have an Impala table that I'd like to query using Ibis. The table looks like the following:
id | timestamp
-------------------
A | 5
A | 7
A | 3
B | 9
B | 5
I'd like to group_by this table according to unique combinations of id and timestamp range. The grouping operation should ultimately produce a single grouped object that I can then apply aggregations on. For example:
group1 conditions: id == A; 4 < timestamp < 11
group2 conditions: id == A; 1 < timestamp < 6
group3 conditions: id == B; 4 < timestamp < 7
yielding a grouped object with the following groups:
group1:
id | timestamp
-------------------
A | 5
A | 7
group2:
id | timestamp
-------------------
A | 5
A | 3
group3:
id | timestamp
-------------------
B | 5
Once I have the groups I'll perform various aggregations to get my final results. If anybody could help me figure this group_by out it would be greatly appreciated, even a regular pandas expression would be helpful!

So here is an example for groupby (no underscore):
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,3,4,5,6]})
create a grouper column for your timestamp.
df["my interval"] = (df["timestamp"] > 3 )& (df["timestamp"] <5)
"you need some _data_ columns, i.e. those which you do not use for grouping"
df["dummy"] = 1
df.groupby(["id", "my interval"]).agg("count")["dummy"]
Or you can use both:
df["something that I need"] = df["my interval"] & (df["id"] == "b")
df.groupby(["something that I need"]).agg("count")["dummy"]
you might also want to apply integer division to generate time intervals:
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,13,14,25,26], "sales": [0,4,2,3,6,7]})
epoch = 10
df["my interval"] = epoch* (df["timestamp"] // epoch)
df.groupby(["my interval"]).agg(sum)["sales"]
EDIT:
your example:
import pandas as pd
A = "A"
B = "B"
df = pd.DataFrame({"id":[A,A,A,B,B], "timestamp":[5,7,3,9,5]})
df["dummy"] = 1
Solution:
grouper = (df["id"] == A) & (4 < df["timestamp"] ) & ( df["timestamp"] < 11)
df.groupby( grouper ).agg(sum)["dummy"]
or better:
df[grouper]["dummy"].sum()

Related

Check if specific values in column follow each other for each id

I have the following dataframe
id | status
____________
1 | reserved
2 | signed
1 | waiting
1 | signed
3 | waiting
2 | sold
3 | reserved
1 | sold
I want to chech a hypothesis that statuses reserved, waiting, signed always lead to status sold. I only need to check the following order, some statuses may be omitted like for id == 2 in dataframe.
I wonder if there's a way to look for next row values in grouped by id dataframe
Expected output is dataframe or list of ids that follow the above rule. For the dataframe above it would be this:
id
__
1
2
My attemp was to get all unique id with those statuses and then for each id found list of it's statuses. Then I thought to filter it somehow but there are a lot of combinations.
df = df[df.status.isin(['reserved', 'waiting', 'signed', 'sold'])]
df1 = df.groupby('flat_id').['status'].unique()
df1.where('status'== [''reserved', 'waiting', 'signed', 'sold'']
or ['reserved', 'waiting', 'sold'] ... )
IIUC, you just want to check that 'sold' is the last value per group:
m = df.groupby('id')['status'].apply(lambda s: s.iloc[-1] == 'sold')
out = m[m].index.tolist()
output: [1, 2]
If you want to ensure there is something before 'sold':
m = df.groupby('id')['status'].apply(lambda s: len(s)>1 and s.iloc[-1] == 'sold')
And if you want to ensure that this something is in a specific list:
m = df.groupby('id')['status'].apply(lambda s: s.isin(['reserved', 'waiting', 'signed']).any()
and s.iloc[-1] == 'sold')
m[m].index.tolist()
alternative:
(df.drop_duplicates('id', keep='last')
.loc[lambda d: d['status'].eq('sold'), 'id']
)
output:
5 2
7 1
Name: id, dtype: int64

Alter dataframe based on values in other rows

I'm trying to alter my dataframe to create a Sankey diagram.
I've 3 million rows like this:
client_id | | start_date | end_date | position
1234 16-07-2019 27-03-2021 3
1234 18-07-2021 09-10-2021 1
1234 28-03-2021 17-07-2021 2
1234 10-10-2021 20-11-2021 2
I want it to look like this:
client_id | | start_date | end_date | position | source | target
1234 16-07-2019 27-03-2021 3 3 2
1234 18-07-2021 09-10-2021 1 1 2
1234 28-03-2021 17-07-2021 2 2 1
1234 10-10-2021 20-11-2021 2 2 4
Value 4 is the value that I use as "exit in the flow.
I have no idea how to do this.
Background: the source and target values contain the position values based on start_date and end_date. So for example in the first row the source is position value 3 but the target is position value 2 because after the end date client changed from position 3 to 2.
Because the source and target are calculated by each client's date order. So it is possible to order the date and find its next position.
columns = ["client_id" ,"start_date","end_date","position"]
data = [
["1234","16-07-2019","27-03-2021",3],
["1234","18-07-2021","09-10-2021",1],
["1234","28-03-2021","17-07-2021",2],
["1234","10-10-2021","20-11-2021",2],
["5678","16-07-2019","27-03-2021",3],
["5678","18-07-2021","09-10-2021",1],
["5678","28-03-2021","17-07-2021",2],
["5678","10-10-2021","20-11-2021",2],
]
df = pd.DataFrame(
data,
columns=columns
)
df = df.assign(
start_date = pd.to_datetime(df["start_date"]),
end_date = pd.to_datetime(df["end_date"])
)
sdf = df.assign(
rank=df.groupby("client_id")["start_date"].rank()
)
sdf = sdf.assign(
next_rank=sdf["rank"] + 1
)
combine_result = pd.merge(sdf,
sdf[["client_id", "position", "rank"]],
left_on=["client_id", "next_rank"],
right_on=["client_id", "rank"],
how="left",
suffixes=["", "_next"]
).fillna({"position_next": 4})
combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
{"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])

pandas: how to groupby / pivot retaining the NaNs? Converting float to str then back to float works but seems convoluted

I am tracking in which "month" a certain event has taken place. If it hasn't, the "month" field is a NaN. The starting table looks like this:
+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1 | a | 100 |
| nan | a | 300 |
| 2 | a | 200 |
+-------+----------+---------+
I am trying to build a crosstab like this:
+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
| 1 | 0.16 |
| 2 | 0.50 |
+-------+----------------------------------+
In month 1, the event has happened for 100/600, ie for 16%
In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.
My issue is with NaNs. Pandas automatically removes NaNs from any groupby / pivot / crosstab. I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.
Any suggestions?
The following works but seems extremely convoluted:
Convert "month" to string
Do a crosstab
Convert "month" back to float (can I do it without moving the index to a column, and then the column back to the index?)
Sort again
Do the cumsum
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])
ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
values = df['amount'] ,aggfunc = 'sum', margins = True ,\
normalize = 'columns', dropna = False)
# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)
Use:
new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
.groupby(df['Category'])
.cumsum()
.div(df.groupby('Category')['Balance']
.transform('sum'))).dropna()
print(new_df)
Month Category Balance cumulative
0 1.0 a 100 0.166667
2 2.0 a 200 0.500000
If you want create a DataFrame for each Category you could create a dict:
df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
df.groupby(by=df.Month.fillna(np.inf))
.apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
.reset_index(level=0, drop=True)
)
df.dropna()
Month Category Balance Category a - cumulative % amount
0 1 a 100 0.166667
2 2 a 200 0.333333

Using Pandas to join and append columns in a loop

I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |
Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.

How to parse all the values in a column of a DataFrame?

DataFrame df has a column called amount
import pandas as pd
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns = ['Amount'])
df:
ID | Amount
0 | $3,000,000.00
1 | $3,000.00
2 | $200.5
3 | $5.5
I want to parse all the values in column amount and extract the amount as a number and ignore the decimal points. End result is DataFrame that looks like this:
ID | Amount
0 | 3000000
1 | 3000
2 | 200
3 | 5
How do I do this?
You can use str.replace with double casting by astype:
df['Amount'] = (df.Amount.str.replace(r'[\$,]', '').astype(float).astype(int))
print (df)
Amount
0 3000000
1 3000
2 200
3 5
You need to use the map function on the column and reassign to the same column:
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )
df.Amount = df.Amount.map(lambda s: int(locale.atof(s[1:])))
PS: This uses the code from How do I use Python to convert a string to a number if it has commas in it as thousands separators? to convert a string representing a number with thousands separator to an int
Code -
import pandas as pd
def format_amount(x):
x = x[1:].split('.')[0]
return int(''.join(x.split(',')))
df = pd.DataFrame(['$3,000,000.00','$3,000.00', '$200.5', '$5.5'], columns =
['Amount'])
df['Amount'] = df['Amount'].apply(format_amount)
print(df)
Output -
Amount
0 3000000
1 3000
2 200
3 5

Categories