I have dataset consists of categorical and numerical columns.
For instance: salary dataset
columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']
four categorical columns and two numerical columns and I want to use three aggregate functions:
cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']
Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:
my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"
So queries are:
q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job
etc
I used for loop to get the result from all combinations.
My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.
Pandas has a different way to querying data.
Sample dataframe:
df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'],
['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
['manager', 'US', 48,9000, 'master','married']]),
columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])
First import the libaries
import pandas as pd
Build the sample dataframe
df = pd.DataFrame( {
"job" : ["programmer","data scientist","manager"] ,
"country_origin" : ["US","UK","US"],
"age": [28,30,48],
"salary": [4000,5000,9000],
"degree": ["master","PhD","master"],
"marital_status": ["unmarried","unmarried","married"]} )
apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby
married=df[df['marital_status']=='married']
q1 = select job, avg(age) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )
age
job
manager 48
q2 = select job, avg(salary) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"salary":"mean"} )
salary
job
manager 9000
You can flatten the table by resetting the index
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()
job age
0 manager 48
output the two stats together:
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()
job age salary
0 manager 48 9000
After you create your dataframe (df), the following command builds your desired table.
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Here is a complete example:
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Related
I'm trying to create a mapping file.
The main issue is to compare two dataframes by using one column, then return a file of all matchine strings in both dataframes alongside some columns from the dataframes.
Example data
df1 = pd.DataFrame({
'Artist':
['50 Cent', 'Ed Sheeran', 'Celine Dion', '2 Chainz', 'Kendrick Lamar'],
'album':
['Get Rich or Die Tryin', '+', 'Courage', 'So Help Me God!', 'DAMN'],
'album_id': ['sdf34', '34tge', '34tgr', '34erg', '779uyj']
})
df2 = pd.DataFrame({
'Artist': ['Beyonce', 'Ed Sheeran', '2 Chainz', 'Kendrick Lamar', 'Jay-Z'],
'Artist_ID': ['frd345', '3te43', '32fh5', '235he', '345fgrt6']
})
So the main idea is to create a function that provides a mapping file that will take an item in artist name column from df1 and then check df2 artist name column to see if there are any similarities then create a mapping dataframe which contains the similar artist column, the album_id and the artist_id.
I tried the code below but I'm new to python so I got lost in the function. I would appreciate some help on a new function or a build up on what I was trying to do.
Thanks!
Code I failed to build:
def get_mapping_file(df1, df2):
# I don't know what I'm doing :'D
for i in df2['Artist']:
if i == df1['Artist'].any():
name = i
df1_id = df1.loc[df1['Artist'] == name, ['album_id']]
id_to_use = df1_id.album_id[0]
df2.loc[df2['Artist'] == i, 'Artist_ID'] = id_to_use
return df2
The desired output is:
Artist
Artist_ID
album_id
Ed Sheeran
3te43
34tge
2 Chainz
32fh5
34erg
Kendrick Lamar
235he
779uyj
I am not sure if this is actually what you need, but your desired output is an inner join between the two dataframes:
pd.merge(df1, df2, on='Artist', how='inner')
This will give you the rows for Artists present in both dataframes.
For me, it's easy to find that result. So you may do this:
frame = df1.merge(df2, how='inner')
frame = frame.drop('album', axis=1)
and then you'll have your result. Thanks !
I have this dataframe example:
match_id, map_type, server and duration_minutes are common variables of a match. In this example we have 5 different matches.
profile_id, country, rating, color, team, civ, won are specific variables for every player that played this specified match.
How can i obtain new dataframe with this structure?
match_id, map_type, server, duration_minutes, profile_id_player1, country_player1, rating_player1, color_player1, team_player1, civ_player1, won_player1, profile_id_player2, country_player2, rating_player2, color_player2, team_player2, civ_player2, won_player2?
Only one row by match_id with all specific variables for every player.
EDIT:This is the result by the solution of #darth baba almost done
Thank you in advance.
First groupby match_id then aggregate all the other columns to the list and then expand those list to columns, to achieve that try this:
df = pd.groupby(['match_id', 'map_type', 'server', 'duration_minutes'])['profile_id', 'country', 'rating', 'color', 'team', 'civ', 'won'].agg(list)
df = pd.concat([df[i].apply(pd.Series).set_index(df.index) for i in df.columns], axis=1).reset_index()
# Rename the columns accordingly
df.columns = [ 'match_id', 'map_type', 'server', 'duration_minutes', 'profile_id_player1', 'country_player1', 'rating_player1', 'color_player1', 'team_player1', 'civ_player1', 'won_player1', 'profile_id_player2', 'country_player2', 'rating_player2', 'color_player2', 'team_player2', 'civ_player2', 'won_player2']
Disclaimer: I am brand new to SO and Python.
I am trying to convert my SQL left join query to python.
Example:
df1 is a dataframe which contains the columns: City, Event, date
df2: City, Zip Code, State, Country, etc.
SQL:
SELECT Events.City, Events.Event, Events.Date, Masterlist.State, Masterlist.Country, Masterlist.[Zip Code]
FROM Events LEFT JOIN Masterlist ON Events.City = Masterlist.City
PYTHON:
df1 = pd.read_csv('Events.csv')
df2 = pd.read_csv('Masterlist.csv')
df3 = df1.join(df2, how='left')
df3 output:
City, Event, date, Zip Code, State, Country
Fremont, Charity, 6/11, 99999, CA, US
Oakland, Protest, 6/11, 99998, CA, US
Fremont, Concert, 6/12, null, null, null
Oakland, Concert, 6/12, null, null, null
Ideal output is that it references df2 and returns the value based on City.It is currently only returning it for the first found value with that City. How can I get it to populate with its respective State, Zip Code, and Country for each row item?
You didn't specify on, I believe the below works:
import pandas as pd
df1 = pd.DataFrame({'City':["Tucson","Tucson","Portland","San Diego"],
"Event":[1,5,3,2],
"date":[1,2,3,4]})
df2 = pd.DataFrame({"City":["San Diego", "Tucson", "Portland"],
"zip":[1,2,3], "state":["CA", "AZ", "OR"],
"country":["USA","USA","USA"]})
pd.merge(df1,df2, how="left", on="City")
It is also best to provide a minimal dataset (like I did above) to make it easy for people to work with and help you out.
I have a pandas dataframe which looks like this:
Name Age
0 tom 10
1 nick 15
2 juli 14
I am trying to iterate over each name --> connect to a mysql database --> match the name with a column in the database --> fetch the id for the name --> and replace the id in the place of name
in the above data frame. The desired output is as follows:
Name Age
0 1 10
1 2 15
2 4 14
The following is the code that I have tried:
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
for index, rows in df.iterrows():
cquery="select id from students where studentsName="+'"' + rows['Name'] + '"'
sid = pd.read_sql(cquery, con=engine)
df['Name'] = sid['id'].iloc[0]
print(df[['Name','Age')
The above code prints the following output:
Name Age
0 1 10
1 1 15
2 1 14
Name Age
0 2 10
1 2 15
2 2 14
Name Age
0 4 10
1 4 15
2 4 14
I understand it iterates through the entire table for each matched name and prints it. How do you get the value replaced only once.
Slight rewrite of your code, if you want to do a transformation in general on a dataframe this is a better way to go about it
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def replace_name(name: str) -> int:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
return sid['id'].iloc[0]
df[Name] = df[Name].apply(lambda x: replace_name(x.value))
This should perform the transformation you're looking for
The problem in your code as written is the line:
df['Name'] = sid['id'].iloc[0]
This sets every value in the Name column to the first id entry in your query result.
To accomplish what you want, you want something like:
df.loc[index, 'Name'] = sid['id'].iloc[0]
This will set the value at index location index in column name to the first id entry in your query result.
This will accomplish what you want to do, and you can stop reading here if you're in a hurry. If you're not in a hurry, and you'd like to become wiser, I encourage you to read on.
It is generally a mistake to loop over the rows in a dataframe. It's also generally a mistake to iterate through a list carrying out a single query on each item in the list. Both of these are slow and error-prone.
A more idiomatic (and faster) way of doing this would be to get all the relevant rows from the database in one query, merge them with your current dataframe, and then drop the column you no longer want. Something like the following:
names = df['Name'].tolist()
query = f"select id, studentsName as Name where name in({','.join(names)})"
student_ids = pd.read_sql(query, con=engine)
df_2 = df.merge(student_ids, on='Name', how='left')
df_with_ids = df_2[['id', 'Age']]
One query executed, no loops to worry about. Let the database engine and Pandas do the work for you.
You can do this kind of operations the following way, please follow comments and feel free to ask questions:
import pandas as pd
# create frame
x = pd.DataFrame(
{
"name": ["A", "B", "C"],
"age": [1, 2, 3]
}
)
# create some kind of db
mock_database = {"A": 10, "B": 20, "C": 30}
x["id"] = None # add empty column
print(x)
# change values in the new column
for i in range(len(x["name"])):
x["id"][i] = mock_database.get(x["name"][i])
print("*" * 100)
print(x)
A good way to do that would be :
import pandas as pd
import MySQLdb
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://root:Abc#123def#localhost/aivu")
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)
name_ids = []
for student_name in df['Name']:
cquery="select id from students where studentsName='{}'".format(student_name)
sid = pd.read_sql(cquery, con=engine)
name_ids.append(sid if sid is not None else None )
# DEBUGED WITH name_ids = [1,2,3]
df['Name'] = name_ids
print(df)
I checked with an example list of ids and it works , I guess if the query format is correct this will work.
Performance-wise I could not think a better solution , since you will have to do a lot of queries (one for each student) but there probably is some way to get all the ids with less queries.
I have two dataframes in pandas. Where one is data coming from an external source and another is coming from a mysql db. Both have the same columns (info, date, link).
I am running merge query on it to drop duplicates so in a sense to compare the external one with what is already in the database and drop the duplicates in the external one before inserting into the db.
column_titles = [
'Title',
'PublishDate',
'ExternalUrl'
]
params_list = ['2018-07-01 00:00:00']
df = dbData.read(select_article_news_data(), column_titles, params_list)
df_content = df['data'].rename(columns={'Title': 'info','PublishDate' : 'date','ExternalUrl': 'link'})
sql_col_titles = [
'info',
'date',
'link'
]
sql_df = dbData.read(select_article_news_mysql(), sql_col_titles, None, mysql=True)
sql_df = sql_df['data']
df_all = df_content.merge(sql_df.drop_duplicates(), on=['info', 'date', 'link'],
how='left', indicator=True )
df_all = df_all.loc[df_all['_merge'] == 'left_only']
new_df = df_all.drop(columns=['_merge'])
new_df.to_sql(con=cnx, name='news', if_exists='append', index=False)
dbData.read simply invokes the pandas read_sql method, and performs some other stuff that is not related and returns a dicitonary, where the result set is df['data'].
The error I am getting is initially I have 306 entries when the script is first run. After this when I run it again is get 404 and these extra entries are all duplicates. When comparing using a conditional everything is True so it should drop these duplicates.
I have used this method before and it works. Could it be to do with my renaming the column names for df_content?