I have 2 dataframes, df1 and df2 like this:
df1=
person_id
10001
...
10900
df2=
person_id month_1 place_1
10001 255 X
...
10900 2111 Y
10900 500 X
10900 200 X
I want to left join df2 on df1 only where place_1 is X and the final value as the sum(month_1)
Like this :
newdf=
person_id month_1 place_1
10900 700 X
So far, I've thought of constructing my sqlite3 code as follows :
import sqlite3
conn=sqlite3.connect(':memory:')
crsr=conn.cursor()
qry='''
SELECT df1.*
FROM df1
left join df2 on sum(month_1)
WHERE UPPER(place_1) like '%X%'
group by df2.person_id
on df1.person_id = df2.person_id;
'''
new_df=pd.read_sql(qry,conn)
What is going wrong in my query approach? How should I implement my query logic correctly?
I'm learning how to use SQL to manage my data within Python. Any help would be greatly helpful!
If i got your question right, you are looking for all records in df2 with a place like X summed up and if that person has got some records in df1 then pull those as well.
To do that the following would get you the record set. (While aggregating the non-grouped columns should be in a aggregating function,such as MAX or MIN etc)
SELECT df2.person_id
,sum(df2.month_1)
,max(df1.person_name)
FROM df2
LEFT JOIN df1
ON df2.person_id=df1.person_id
WHERE UPPER(df2.place_1) like '%X%'
GROUP BY df2.person_id
This is your mistake:
left join df2 on sum(month_1)
ON must be followed a condition on which to join rows. sum(month_1) is not a condition, but a single value.
And while, say sum(month_1) > 0 is a condition, it wouldn't work either, because you are joining single rows, and sum(month_1) is not a row's value, but an aggregation over several rows.
You have on df1.person_id = df2.person_id later, but the ON clause belongs with the JOIN, not at the end of the query.
What you want is to select SUM(df2.month_1), so put it in the SELECT clause. The following query gives you all df1 rows along with their month_1 sum (or null, when there are no df2 entries for the person).
SELECT df1.*, SUM(df2.month_1)
FROM df1
left join df2 ON df2.person_id = df1.person_id
WHERE UPPER(df1.place_1) = 'X'
GROUP BY df1.person_id;
I don't know whether SQLite supports grouping by a key and selecting its functional dependent columns (df1.*), though. If you only want to show df1.person_id then you should replace df1.* by df1.person_id. If you want more df1 columns and SQLIte doesnt allow df1.*, then you may want to aggregate before joining (which I consider good style anyway):
SELECT df1.*, d2.total
FROM df1
left join
(
SELECT person_id, SUM(month_1) AS total
FROM df2
GROUP BY person_id
) d2 ON d2.person_id = df1.person_id
WHERE UPPER(df1.place_1) = 'X';
Try below, it doesn't join data, just filters by place and IDs in df1:
select person_id, sum(month_1) from df2
where place_1 = 'X' and
exists(select 1 from df1
where person_id = df2.person_id)
group by person_id
or using in:
select person_id, sum(month_1) from df2
where place_1 = 'X' and
person_id in (select person_id from df1)
group by person_id
I assume that you want all the rows of df1 and this is why you use a LEFT join.
So the condition UPPER(df2.place_1) LIKE '%X%' should be set in the ON clause and not in the WHERE clause:
SELECT df1.person_id, SUM(month_1) AS month_1, MAX(place_1) place_1
FROM df1 LEFT JOIN df2
ON df1.person_id = df2.person_id AND UPPER(df2.place_1) LIKE '%X%'
GROUP BY df1.person_id;
If instead of NULLs you want 0s in the results for the non matching rows then change SUM(month_1) to:
COALESCE(SUM(month_1), 0)
See the demo.
Results:
| person_id | month_1 | place_1 |
| --------- | ------- | ------- |
| 10001 | 255 | X |
| 10900 | 700 | X |
Related
I have two dataframes with the exact same layout, just spanning different time periods.
DF1 represents the pre-period and DF2 represents the post-period
ID
Product
Count
123
1111
2
123
2222
1
567
1111
5
789
2222
2
I want to isolate the rows in DF1, where the ID exists in DF2.
So, for example, if ID 123 was not present in the post period, DF2, I do not want it to appear in this new dataframe, DF3.
Due to the possibility of multiple ID records in both dataframes, my join logic is duplicating when I try a traditional dataframe join.
I am hoping for an easy way, like in SQL, where you can use the syntax WHERE df1.ID IN (select df2.id from df2)
What is the best way to accomplish this? Thanks in advance!
Use leftsemi join if I get you correctly
DF1.join(DF2, how='leftsemi', on='ID').show()
How to compare column1 from Df1 to column1 of different tables using Python (pandas)
Df1 contains the result of a SQL statement (select * from emp_table)
Df2 contains the result of a SQL statement (select * from company_table)
I am able to compare only 1 column from Df1 to 1 column of Df2. how to compare column1 of Df1 to column1 of Df3, Df4 etc
Df1 = df.column1
Df2 = df.column1
Df1.compare(Df2,keep_shape=True, keep_equal=True)
Any help would be appreciated as I am new to python
Since I can't upload sensitive data, I have created a sample table and columns. So, in the below snapshot, you can see Table1, Table2, Table3, Table4...Table 100 (there are many tables)
Now, I need to compare column1 of Table1 with Column1 of Table2, column1 of Table3 ... column1 of Table100.. in future tables might increase.
If the values of column matches then I need to mark it as pass or else fail
I have multiple dfs with two common columns
Sample df
user_id and event_date
abc | 1st june
abc | 2nd June
cdf | 15th july
dfg | 17th July
I want to check if a user_id on a particular event_date in df1 also exists in df2, df3, df4, and df5
How do I find this ?
the following methods I tried but it worked with only taking "user_id" into consideration and not with "event_date"
method 1:
upi_sms =df1.assign(Insms=df2.user_id.isin(df1.user_id).astype(int))
method 2:
merging dataframes on = [user_id, event_date]
none of it gives me expected results.
Expected Result:
Combination of abc and 1st June should exist in df2
How do I achieve this?
I would do it following way, consider simple example:
import pandas as pd
df1 = pd.DataFrame({'x':['A','B','C'],'y':[1,2,3]})
df2 = pd.DataFrame({'x':['C','A','B'],'y':[3,2,1]})
df3 = pd.DataFrame({'x':['A','B','C'],'y':[0,0,0]})
and say you are interested in last row of df1, i.e. where x is C and y is 3. Such row is also present in df2 (1st) but not df3 where there is row with x being C but have different.
row = tuple(df1.iloc[-1]) # get last row of df1 as tuple
print(row in df2.itertuples(index=False)) # True
print(row in df3.itertuples(index=False)) # False
Observe it is important to pass index=False as we did not want to take into account where number is inside pandas.DataFrame
I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]
I have a dataframe of transactions:
id | type | date
453| online | 08-12-19
453| instore| 08-12-19
453| return | 10-5-19
There are 4 possible types: online, instore, return, other. I want to create boolean columns where I see if for each unique customer if they ever had a given transaction type.
I tried the following code but it was not giving me what I wanted.
transactions.groupby('id')['type'].transform(lambda x: x == 'online') == 'online'
Use get_dummies with aggregate max for indicaro columns per groups and last add DataFrame.reindex for custom order and add possible misisng types filled by 0:
t = ['online', 'instore', 'return', 'other']
df = pd.get_dummies(df['type']).groupby(df['id']).max().reindex(t, axis=1, fill_value=0)
print (df)
online instore return other
id
453 1 1 1 0
Another idea with join per groups and Series.str.get_dummies:
t = ['online', 'instore', 'return', 'other']
df.groupby('id')['type'].agg('|'.join).str.get_dummies().reindex(t, axis=1, fill_value=0)