Python SQL pandasql self join outer join - python

Using PostgreSQL/PandasSQL, I am trying to implement self join on a data table with shop ids, product ids, and levels. The Shop and Product ID determine a level. Unknown levels are indicated by null. The problem is as follows: For Shop ID A, find the level of the product in other shops. For products not sold by other shops, report a null value. For products which are not sold by Shop A, report a null value.
In order to solve this problem, I can also use sqlalchemy with postgresql. For ease of running, I have included the sample data as pandas dataframes.
I have provided the complete pandas SQL code below as well as the expected solution.
from io import StringIO
import pandas as pd
import pandasql as ps
SOURCE_DATA = StringIO("""
shop_id,product_id,level
A,p,1
A,q,2
A,r,3
B,p,2
B,q,1
B,s,3
C,p,3
C,q,1
C,t,2
D,p,3
D,q,3
D,r,3
E,s,1
E,t,2
E,u,3
""")
EXPECTED = StringIO("""
target_shop_id,shop_id,product_id,target_level,level
A,B,p,1,2
A,B,q,2,1
A,B,r,3,NULL
A,B,s,NULL,3
A,C,p,1,3
A,C,q,2,1
A,C,r,3,NULL
A,C,t,NULL,2
A,D,p,1,3
A,D,q,2,3
A,D,r,3,3
A,E,p,1,NULL
A,E,q,2,NULL
A,E,r,3,NULL
A,E,s,NULL,1
A,E,t,NULL,2
A,E,u,NULL,3
""")
df_source = pd.read_csv(SOURCE_DATA, sep=",")
df_expected = pd.read_csv(EXPECTED, sep=",")
print(df_source)
print(df_expected)
df_target = ps.sqldf("select * from df_source where shop_id = 'A'")
df_non_target = ps.sqldf("select * from df_source where shop_id != 'A'")
df_result = ps.sqldf("select t.shop_id as target_shop_id, t1.shop_id, t.product_id, t.level as target_level, t1.level "
+ "from df_non_target as t left join df_target as t1 on t.product_id = t1.product_id;")
df_result_union = ps.sqldf("select t.shop_id as target_shop_id, t1.shop_id, t.product_id, t.level as target_level, t1.level "
+ "from df_non_target as t left join df_target as t1 on t.product_id = t1.product_id "
+ "union select t.shop_id as target_shop_id, t1.shop_id, t.product_id, t.level as target_level, t1.level "
+ "from df_target as t left join df_non_target as t1 on t.product_id = t1.product_id ;")
print(df_result)
Expected Result:
target_shop_id shop_id product_id target_level level
0 A B p 1.0 2.0
1 A B q 2.0 1.0
2 A B r 3.0 NaN
3 A B s NaN 3.0
4 A C p 1.0 3.0
5 A C q 2.0 1.0
6 A C r 3.0 NaN
7 A C t NaN 2.0
8 A D p 1.0 3.0
9 A D q 2.0 3.0
10 A D r 3.0 3.0
11 A E p 1.0 NaN
12 A E q 2.0 NaN
13 A E r 3.0 NaN
14 A E s NaN 1.0
15 A E t NaN 2.0
16 A E u NaN 3.0
My Result:
target_shop_id shop_id product_id target_level level
0 B A p 2 1.0
1 B A q 1 2.0
2 B None s 3 NaN
3 C A p 3 1.0
4 C A q 1 2.0
5 C None t 2 NaN
6 D A p 3 1.0
7 D A q 3 2.0
8 D A r 3 3.0
9 E None s 1 NaN
10 E None t 2 NaN
11 E None u 3 NaN
My result is missing 5 rows where the 'level' is NaN.
Any suggestions as to how to fix my code?

Related

How to calculate pairwise co-occurrence matrix based on dataframe?

I have a dataframe, about 800,000 rows and 16 columns, below is an example from the data,
import pandas as pd
import datetime
start = datetime.datetime.now()
print('Starting time,'+str(start))
dict1 = {'id':['person1','person2','person3','person4','person5'], \
'food1':['A','A','A','C','D' ], \
'food2':['B','C','B','A','B'], \
'food3':['','D','C','',''], 'food4':['','','D','','',] }
demo = pd.DataFrame(dict1)
demo
>>>Out[13]
Starting time,2022-11-30 12:08:41.414807
id food1 food2 food3 food4
0 person1 A B
1 person2 A C D
2 person3 A B C D
3 person4 C A
4 person5 D B
My ideal result format is as follows,
>>>Out[14]
A B C D
A 0 2 3 2
B 2 0 1 2
C 3 1 0 2
D 2 2 2 0
I did the following:
I've searched a bit through stackoverflow, google, but so far haven't come across an answer that helps with my problem.
I tried to code it myself, my idea was to first build each pairing, then combine everything into a string, and finally find the number of duplicates, but limited by my code capabilities, it's a work in progress.Also, the "new" combination of the next of one pair and the previous of another pair may cause errors in the process of finding duplicates.
Thank you for your help.
You could try this:
out = pd.get_dummies(data=demo.iloc[:,1:].stack()).sum(level=0).ne(0).astype(int)
final = out.T.dot(out).astype(float)
np.fill_diagonal(final.values, np.nan)
>>>final
A B C D
A NaN 2.0 3.0 2.0
B 2.0 NaN 1.0 2.0
C 3.0 1.0 NaN 2.0
D 2.0 2.0 2.0 NaN
If I understand your goal correctly you can use this:
uniques = demo[[x for x in demo.columns if 'id' not in x]].stack().unique()
pd.DataFrame(index = uniques, columns = uniques).fillna(np.NaN)

Python Dataframe filling up non existing

I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN

How to add a row to the top of a pandas dataframe?

I read my data by this:
dataset = pd.read_csv(r' ...\x.csv')
Then specify choose some of them like this:
dataset = dataset.loc[len(dataset)-data_length: , :]
Do shifting:
dataset_shifted = dataset.shift(1)
dataset_shifted = dataset_shifted.dropna()
And like to add a new row equal to 1 to the top of my dataset. But using the following command doesn't work because my data indexes are from 3714 to 3722 and it adds an index 0 to end of the dataframe not to the top of it!
dataset_shifted = dataset_shifted .loc[0 , :] = 1
If no missing values in DataFrame you can simplify your solution by remove dropna and using DataFrame.fillna:
dataset = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
}, index=[3714, 3715, 3716])
print (dataset)
B C D
3714 4 7 1
3715 5 8 3
3716 4 9 5
dataset_shifted = dataset.shift(1).fillna(1)
print (dataset_shifted)
B C D
3714 1.0 1.0 1.0
3715 4.0 7.0 1.0
3716 5.0 8.0 3.0
If possible missing values only set first row by position by DataFrame.iloc:
dataset_shifted = dataset.shift(1)
dataset_shifted.iloc[0 , :] = 1
Your solution should be changed:
dataset_shifted = dataset.shift(1)
dataset_shifted = dataset_shifted.dropna()
dataset_shifted.loc[0 , :] = 1
dataset_shifted = dataset_shifted.sort_index()
print (dataset_shifted)
B C D
0 1.0 1.0 1.0
3715 4.0 7.0 1.0
3716 5.0 8.0 3.0

populate missing values for multiple columns with multiple values

I have gone through the posts that are similar to filling out the multiple columns for pandas in one go, however it appears that my problem here is a little different, in the sense that I need to be able to populate a missing column value with a specific column value and be able to do that for multiple columns in one go.
Eg: I can use the commands as below individually to fill the NA's
result1_copy['BASE_B'] = np.where(pd.isnull(result1_copy['BASE_B']), result1_copy['BASE_S'], result1_copy['BASE_B'])
result1_copy['QWE_B'] = np.where(pd.isnull(result1_copy['QWE_B']), result1_copy['QWE_S'], result1_copy['QWE_B'])
However, if I were to try populating it one go, it does not work:
result1_copy['BASE_B','QWE_B'] = result1_copy['BASE_B', 'QWE_B'].fillna(result1_copy['BASE_S','QWE_S'])
Do we know why ?
Please note I have only used 2 columns here for ease of purpose, however I have 10s of columns to impute. And they are either object, float or datetime.
Is datatypes the issue here ?
You need add [] for filtered DataFrame and for align columns add rename:
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d))
More dynamic solution:
L = ['BASE_','QWE_']
orig = ['{}B'.format(x) for x in L]
new = ['{}S'.format(x) for x in L]
d = dict(zip(new, orig))
result1_copy[orig] = (result1_copy[orig].fillna(result1_copy[new]
.rename(columns=d)))
Another solution if match columns with B and S:
for x in ['BASE_','QWE_']:
result1_copy[x + 'B'] = result1_copy[x + 'B'].fillna(result1_copy[x + 'S'])
Sample:
result1_copy = pd.DataFrame({'A':list('abcdef'),
'BASE_B':[np.nan,5,4,5,5,np.nan],
'QWE_B':[np.nan,8,9,4,2,np.nan],
'BASE_S':[1,3,5,7,1,0],
'QWE_S':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a NaN 1 a NaN 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f NaN 0 b NaN 4
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = (result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d)))
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a 1.0 1 a 5.0 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f 0.0 0 b 4.0 4

Merge the first row with the column headers in a dataframe

I am trying to clean up a Excel file for some further research. Problem that I have, I want to merge the first and second row. The code which I have now:
xl = pd.ExcelFile("nanonose.xls")
df = xl.parse("Sheet1")
df = df.drop('Unnamed: 2', axis=1)
## Tried this line but no luck
##print(df.head().combine_first(df.iloc[[0]]))
The output of this is:
Nanonose Unnamed: 1 A B C D E \
0 Sample type Concentration NaN NaN NaN NaN NaN
1 Water 9200 95.5 21.0 6.0 11.942308 64.134615
2 Water 9200 94.5 17.0 5.0 5.484615 63.205769
3 Water 9200 92.0 16.0 3.0 11.057692 62.586538
4 Water 4600 53.0 7.5 2.5 3.538462 35.163462
F G H
0 NaN NaN NaN
1 21.498560 5.567840 1.174135
2 19.658560 4.968000 1.883444
3 19.813120 5.192480 0.564835
4 6.876207 1.641724 0.144654
So, my goal is to merge the first and second row to get: Sample type | Concentration | A | B | C | D | E | F | G | H
Could someone help me merge these two rows?
I think you need numpy.concatenate, similar principe like cᴏʟᴅsᴘᴇᴇᴅ answer:
df.columns = np.concatenate([df.iloc[0, :2], df.columns[2:]])
df = df.iloc[1:].reset_index(drop=True)
print (df)
Sample type Concentration A B C D E F \
0 Water 9200 95.5 21.0 6.0 11.942308 64.134615 21.498560
1 Water 9200 94.5 17.0 5.0 5.484615 63.205769 19.658560
2 Water 9200 92.0 16.0 3.0 11.057692 62.586538 19.813120
3 Water 4600 53.0 7.5 2.5 3.538462 35.163462 6.876207
G H
0 5.567840 1.174135
1 4.968000 1.883444
2 5.192480 0.564835
3 1.641724 0.144654
Just reassign df.columns.
df.columns = np.append(df.iloc[0, :2], df.columns[2:])
Or,
df.columns = df.iloc[0, :2].tolist() + (df.columns[2:]).tolist()
Next, skip the first row.
df = df.iloc[1:].reset_index(drop=True)
df
Sample type Concentration A B C D E F \
0 Water 9200 95.5 21.0 6.0 11.942308 64.134615 21.498560
1 Water 9200 94.5 17.0 5.0 5.484615 63.205769 19.658560
2 Water 9200 92.0 16.0 3.0 11.057692 62.586538 19.813120
3 Water 4600 53.0 7.5 2.5 3.538462 35.163462 6.876207
G H
0 5.567840 1.174135
1 4.968000 1.883444
2 5.192480 0.564835
3 1.641724 0.144654
reset_index is optional if you want a 0-index for your final output.
Fetch the all columns present in Second row header then First row header. combine them to make a "all columns name header" list. now create a df with excel by taking header as header[0,1]. now replace its headers with all column name headers you created previously.
import pandas as pd
#reading Second header row columns
df1 = pd.read_excel('nanonose.xls', header=[1] , index = False)
cols1 = df1.columns.tolist()
SecondRowColumns = []
for c in cols1:
if ("Unnamed" or "NaN" not in c):
SecondRowColumns.append(c)
#reading First header row columns
df2 = pd.read_excel('nanonose.xls', header=[0] , index = False)
cols2 = df2.columns.tolist()
FirstRowColumns = []
for c in cols2:
if ("Unnamed" or "Nanonose" not in c):
FirstRowColumns.append(c)
AllColumn = []
AllColumn = SecondRowColumns+ FirstRowColumns
df = pd.read_excel('nanonose.xls', header=[0,1] , index=False)
df.columns = AllColumn
print(df)

Categories