I have this dataframe example:
match_id, map_type, server and duration_minutes are common variables of a match. In this example we have 5 different matches.
profile_id, country, rating, color, team, civ, won are specific variables for every player that played this specified match.
How can i obtain new dataframe with this structure?
match_id, map_type, server, duration_minutes, profile_id_player1, country_player1, rating_player1, color_player1, team_player1, civ_player1, won_player1, profile_id_player2, country_player2, rating_player2, color_player2, team_player2, civ_player2, won_player2?
Only one row by match_id with all specific variables for every player.
EDIT:This is the result by the solution of #darth baba almost done
Thank you in advance.
First groupby match_id then aggregate all the other columns to the list and then expand those list to columns, to achieve that try this:
df = pd.groupby(['match_id', 'map_type', 'server', 'duration_minutes'])['profile_id', 'country', 'rating', 'color', 'team', 'civ', 'won'].agg(list)
df = pd.concat([df[i].apply(pd.Series).set_index(df.index) for i in df.columns], axis=1).reset_index()
# Rename the columns accordingly
df.columns = [ 'match_id', 'map_type', 'server', 'duration_minutes', 'profile_id_player1', 'country_player1', 'rating_player1', 'color_player1', 'team_player1', 'civ_player1', 'won_player1', 'profile_id_player2', 'country_player2', 'rating_player2', 'color_player2', 'team_player2', 'civ_player2', 'won_player2']
Related
I'm trying to create a mapping file.
The main issue is to compare two dataframes by using one column, then return a file of all matchine strings in both dataframes alongside some columns from the dataframes.
Example data
df1 = pd.DataFrame({
'Artist':
['50 Cent', 'Ed Sheeran', 'Celine Dion', '2 Chainz', 'Kendrick Lamar'],
'album':
['Get Rich or Die Tryin', '+', 'Courage', 'So Help Me God!', 'DAMN'],
'album_id': ['sdf34', '34tge', '34tgr', '34erg', '779uyj']
})
df2 = pd.DataFrame({
'Artist': ['Beyonce', 'Ed Sheeran', '2 Chainz', 'Kendrick Lamar', 'Jay-Z'],
'Artist_ID': ['frd345', '3te43', '32fh5', '235he', '345fgrt6']
})
So the main idea is to create a function that provides a mapping file that will take an item in artist name column from df1 and then check df2 artist name column to see if there are any similarities then create a mapping dataframe which contains the similar artist column, the album_id and the artist_id.
I tried the code below but I'm new to python so I got lost in the function. I would appreciate some help on a new function or a build up on what I was trying to do.
Thanks!
Code I failed to build:
def get_mapping_file(df1, df2):
# I don't know what I'm doing :'D
for i in df2['Artist']:
if i == df1['Artist'].any():
name = i
df1_id = df1.loc[df1['Artist'] == name, ['album_id']]
id_to_use = df1_id.album_id[0]
df2.loc[df2['Artist'] == i, 'Artist_ID'] = id_to_use
return df2
The desired output is:
Artist
Artist_ID
album_id
Ed Sheeran
3te43
34tge
2 Chainz
32fh5
34erg
Kendrick Lamar
235he
779uyj
I am not sure if this is actually what you need, but your desired output is an inner join between the two dataframes:
pd.merge(df1, df2, on='Artist', how='inner')
This will give you the rows for Artists present in both dataframes.
For me, it's easy to find that result. So you may do this:
frame = df1.merge(df2, how='inner')
frame = frame.drop('album', axis=1)
and then you'll have your result. Thanks !
The program creates some random products and then creates orders by randomly choosing a product.
Right now every order only has one item; a future version will randomize the number of line items per order.
I've never used Python or Pandas before and I wanted to make sure that my approach is the most efficient way of adding a new row to a DataFrame and selecting a random row from a DataFrame.
Any suggestions?
Thank you
def get_random_products(count=500):
x = 0
df = pd.DataFrame(columns=['product_id', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
while x < count:
row = pd.DataFrame([[x
,get_random_SKU()
,get_price()
,get_category()
,get_size()
,get_color()
,get_style()
,get_gender()]]
,columns=['product_id', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
df = df.append(row
,ignore_index=True)
x += 1
return df
#---
def get_random_orders(products, count=1000, start_order_id=1, number_of_customers=500):
# CustomerID OrderID OrderDate Price Category Size Color Style Gender
x = 0
df = pd.DataFrame(columns=['customer_id', 'order_id', 'order_date', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
while x < count:
# Each time through, choose a random product to be in the order
p = products.to_records()[random.randint(0, len(products)-1)]
row = pd.DataFrame([[get_customer_id(number_of_customers)
,x+1
,get_order_date()
,p['SKU']
,p['price']
,p['category']
,p['size']
,p['color']
,p['style']
,p['gender']]]
,columns=['customer_id', 'order_id', 'order_date', 'SKU', 'price', 'category', 'size', 'color', 'style', 'gender'])
df = df.append(row
,ignore_index=True)
x += 1
return df
#Main code here
catalog = get_random_products(1000)
orders = get_random_orders(catalog, 1000, 1, 500)
For an efficient answer:
My suggestion dives a bit into rules for normalization for databases. The general idea of these rules is to reduce data redundancy (Why enter the same data more than once?). That said, the information is helpful in this scenario and will prepare your code for your end-goal of multiple line-items per order.
Luckily, customer/product/order databases are a typical example for these rules. For customers/orders/line-items, the typical recommendation would be to have one table for each of these types of "entities," and having columns in that table only pertain to the entity. If it is a table that relates one entity to another, there would be a column only containing an identifier for the foreign entity (i.e. 'customer_id' or 'SKU').
So, for your question, my initial setup would be the following:
# Customers DF, no foreign entities
c_df = pd.DataFrame(columns=['customer_id', 'name', 'gender'])
# Products DF, no foreign entities
p_df = pd.DataFrame(columns=['SKU', 'price', 'category', 'size', 'color', style')
# Orders DF, with 'customer_id' being an identifier to tie into c_df
o_df = pd.DataFrame(columns=['order_id', 'order_date', 'customer_id'])
# Line-Items DF, with both order_id and SKU being foreign identities.
li_df = pd.DataFrame(columns=['order_id', 'SKU', 'quantity'])
Once things are set up, I would generate the entities for each DF separately.
def _gen_customers(df, num=1):
new_customers = []
for num:
new_customers = new_customers.append({
customer_id: SOME_ID, # Not sure how you want these to be generated, but you could just replace SOME_ID with 'num' if it is arbitrary
name: SOME_NAME, # I made this assuming you want to name the customers. Could just be left out if unnecessary.
gender: random.choice(['m', 'f']) # https://docs.python.org/3/library/random.html#random.choice
new_df = df.append(new_customers, ignore_index=True) # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
return new_df
c_df = _gen_customers(c_df, 500)
You can do similar functions for the other DFs. To choose from foreign identifiers randomly, you can set up lists of unique values to choose from as so:
all_customers = c_df['customer_id'].unique() # https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html?highlight=unique#pandas.Series.unique
def _pick_customers(num):
""" Returns a list of customer_ids of length 'num'. """
return random.choices(all_customers, num)
For line items, just make one row for each sku per order. You can start with one line per order, but it is expandable to as many as you would like.
Then, typically, you would make the columns that will be searched-by most often the "index" to speed up the searching. ( See pandas.DataFrame.set_index )
c_df = c_df.set_index(keys='customer_id', drop=True)
p_df = o_df.set_index(keys='SKU', drop=True)
o_df = o_df.set_index(keys='order_id', drop=True)
li_df = li_df.set_index('order_id', drop=True)
You may then merge the DFs as applicable to whatever your scenario is.
pandas.DataFrame.merge Docs
Info on types of merges
Problem: I have a dataframe with various column headers that have names with variations of the multiple strings: 'Fee_code','zip_code', etc. and also some others with: 'street_address','violation_street address', etc.
Expected Outcome: A list with all the column headers that match the keywords: Fee, address, code, name, and possibly others based on the specific file that I'll work on. Note that I DO want to keep the 'agency name' column header.
Solution: I came up with this function to list all of the strings listed above - and some more-:
def drop_cols(df):
list1= list(df.filter(like='nam', axis=1))
list1.remove('agency_name')
list2= list(df.filter(like='add', axis=1))
list3= list(df.filter(like='fee', axis=1))
list4 = list(df.filter(like='code', axis=1))
list5 = list(df.filter(like='status', axis=1))
entry= list1+list2+list3+list4+list5
return entry
Challenge: This code works, but it's bulky and I'm wondering if there are better ways to achieve the same
Sample of column headers: 'ticket_id', 'agency_name', 'inspector_name', 'violator_name', 'violation_street_number', 'violation_street_name', 'violation_zip_code', 'mailing_address_str_number', 'mailing_address_str_name', 'city', 'state', 'zip_code', 'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date', 'violation_code', 'violation_description', 'disposition', 'fine_amount', 'admin_fee', 'state_fee', 'late_fee', 'discount_amount', 'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due', 'payment_date', 'payment_status', 'collection_status', 'grafitti_status', 'compliance_detail', 'compliance']
One way you could go about it:
#create search collection of relevant terms
search='|'.join(['fee','address','code','name'])
#use the filter method in pandas with the regex option
#then drop the 'agency_name' column
#d is the dataframe
d.filter(regex=search,axis=1).drop('agency_name',axis=1)
I have dataset consists of categorical and numerical columns.
For instance: salary dataset
columns: ['job', 'country_origin', 'age', 'salary', 'degree','marital_status']
four categorical columns and two numerical columns and I want to use three aggregate functions:
cat_col = ['job', 'country_origin','degree','marital_status']
num_col = [ 'age', 'salary']
aggregate_function = ['avg','max','sum']
Currently, I have my Python code that using raw query, while my objective is to get the group-by query results from all combinations from lists above:
my query: "SELECT cat_col[0], aggregate_function[0](num_col[0]) from DB where marital_status = 'married' groub by cat_col[0]"
So queries are:
q1 = select job, avg(age) from DB where marietal_status='married' groub by job
q2 = select job, avg(salary) from DB where marietal_status='married' groub by job
etc
I used for loop to get the result from all combinations.
My problem is, I want to change that query to Pandas query. I've spent a couple of hours but could not solve it.
Pandas has a different way to querying data.
Sample dataframe:
df2 = pd.DataFrame(np.array([['programmer', 'US', 28,4000, 'master','unmarried'],
['data scientist', 'UK', 30,5000, 'PhD','unmarried'],
['manager', 'US', 48,9000, 'master','married']]),
columns=[['job', 'country_origin', 'age', 'salary', 'degree','marital_status']])
First import the libaries
import pandas as pd
Build the sample dataframe
df = pd.DataFrame( {
"job" : ["programmer","data scientist","manager"] ,
"country_origin" : ["US","UK","US"],
"age": [28,30,48],
"salary": [4000,5000,9000],
"degree": ["master","PhD","master"],
"marital_status": ["unmarried","unmarried","married"]} )
apply the where clause, save as a new dataframe (not necessary, but easier to read), you can of course use the filtered df inside the groupby
married=df[df['marital_status']=='married']
q1 = select job, avg(age) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"age":"mean"} )
or
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} )
age
job
manager 48
q2 = select job, avg(salary) from DB where marietal_status='married' group by job
married.groupby('job').agg( {"salary":"mean"} )
salary
job
manager 9000
You can flatten the table by resetting the index
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean"} ).reset_index()
job age
0 manager 48
output the two stats together:
df[df['marital_status']=='married'].groupby('job').agg( {"age":"mean","salary":"mean"} ).reset_index()
job age salary
0 manager 48 9000
After you create your dataframe (df), the following command builds your desired table.
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
Here is a complete example:
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['job']=['tech','coder','admin','admin','admin','tech']
df['country_origin']=['japan','japan','US','US','India','India']
df['degree']=['cert','bs','bs','ms','bs','cert']
df['age']=[22,23,30,35,40,28]
df['salary']=[30,50,60,90,65,40]
df.groupby(['job', 'country_origin','degree'])[['age', 'salary']].agg([np.mean,max,sum])
I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]