I've been learning pandas for a couple of days. I am migrating a SQL DB to PYTHON and have encountered the sql statement (example):
select * from
table_A a
left join table_B b
on a.ide = b.ide
and a.credit_type = case when b.type > 0 then b.credit_type else a.credit_type end
I've only been able to migrate to the first condition. My difficulty is in the last line and I don't know how to migrate it. Tables are actually sql queries that I've stored in dataframes.
merge = pd.merge(df_query_a, df_query_b),on='ide', how='left')
any suggestions please.
The Case condition is like an if-then-else statement, which you can implement in Pandas using np.where() like below:
Based on left join resulting dataframe merge:
import numpy as np
merge['credit_type_x'] = np.where(merge['type_y'] > 0, merge['credit_type_y'], merge['credit_type_x'])
Here the column names credit_type_x credit_type_y should have been created on the Pandas merge function after renaming conflicting (same) column names on the 2 sources dataframes. In case dataframe merge doesn't have the column type_y because column type appears only on Table_B but not on Table_A, you can use column name type here instead.
Alternatively, as you just need to modify the value of credit_type_x only when type_y > 0 and retain the value of credit_type_x without modification if NOT type_y > 0, we can also do it simply by:
merge.loc[merge['type_y'] > 0, 'credit_type_x'] = merge['credit_type_y']
Below two options to face your problem
You can add a column in df_query_a based on the condition that you need considering two dataframes, and after that, make the merge.
You can try with some library as pandasql3.
So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')
I have a dataframe, call it current_data. This dataframe is generated via running statistical functions over another dataframe, current_data_raw. It has a compound index on columns "Method" and "Request.Name"
current_data = current_data_raw.groupby(['Name', 'Request.Method']).size().reset_index().set_index(['Name', 'Request.Method'])
I then run a bunch of statistical functions over current_data_raw adding new columns to current_data
I then need to query that dataframe for specific values of columns. I would love to do something like:
val = df['Request.Name' == some_name, 'Method' = some_method]['Average']
However this isn't working, nor are the varients I have attempted above. .xs is returning a series. I could grab the only row in the series but that doesn't seem proper.
If want select in MultiIndex is possible use tuple in order of levels, but here is not specified index name like 'Request.Name':
val = df.loc[(some_name, some_method), 'Average']
Another way is use DataFrame.query, but if levels names contains spaces or . is necessary use backticks:
val = df.query("`Request.Name`=='some_name' & `Request.Method`=='some_method'")['Average']
If one word levels names:
val = df.query("Name=='some_name' & Method=='some_method'")['Average']
I'm currently looking at Reddit data set which has comments and subreddit type as two of its columns. My goal is, as there's too many rows, want to restrict the dataset to something smaller.
By looking at df['subreddit'].value_counts > 10000, I am looking for subreddits with more than 10000 comments. How do I create a new dataframe that meets this condition? Would I use loc or set up some kind of if statement?
First you are performing df['subreddit'].value_counts(). This returns a series, what you might want to do, is transform this into a dataframe to later perform some filtering.
What I would do is;
aux_df = df['subreddit'].value_counts().reset_index()
filtered_df = aux_df[aux_df['subreddit'] > 10000].rename(columns={'index':'subreddit','subreddit':'amount'})
Optionally with loc:
filtered_df = aux_df.loc[aux_df['subreddit'].gt(10000)].rename(columns={'index':'subreddit','subreddit':'amount'})
Edit
Based on the comment, I would first create a list of all subreddits with more than 10000 comments, which is provided above, and then simply filter the original dataframe with those values:
df = df[df['subreddit'].isin(list(filtered_df['subreddit']))]
I have two csv files with 30 to 40 thousands records each.
I loaded the csv files into two corresponding dataframes.
Now I want to perform this sql operation on the dataframes instead of in sqlite : update table1 set column1 = (select column1 from table2 where table1.Id == table2.Id), column2 = (select column2 from table2 where table1.Id == table2.Id) where column3 = 'some_value';
I tried to perform the update on dataframe in 4 steps:
1. merging dataframes on common Id
2. getting Ids from dataframe where column 3 has 'some_value'
3. filtering the dataframe of 1st step based on Ids received in 2nd step.
4. using lambda function to insert in dataframe where Id matches.
I just want to know other views on this approach and if there are any better solutions. One important thing is that the size of dataframe is quite large, so I feel like using sqlite will be better than pandas as it gives result in single query and is much faster.
Shall I use sqlite or there are any better way to perform this operation on dataframe?
Any views on this will be appreciated. Thank you.