Related
I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?
Existing format:
game_id school conference homeAway points
332410041 Connecticut American Athletic home 18
332410041 Towson CAA away 33
Desired format:
game_id home_school home_conference home_points away_school away_conference away_points
332410041 Connecticut American Athletic 18 Towson CAA 33
One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.
import pandas as pd
data = {'game_id': [332410041, 332410041],
'school': ['Connecticut', 'Towson'],
'conference':['American Athletic', 'CAA'],
'homeAway': ['home', 'away'],
'points': [18, 33]
}
df = pd.DataFrame(data)
home = df[df['homeAway'] == 'home']
del home['homeAway']
away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']
home.merge(away)
Create two dataframes selected by the unique values in the 'homeAway' column, 'home' and 'away', using Boolean indexing.
Drop the obsolete 'homeAway' column
Rename the appropriate columns with a 'home_', and 'away_' prefix.
This can be done in a for-loop, with each dataframe added to a list, which can be consolidated into a simple list-comprehension.
Use pd.merge to combine the two dataframes on the common 'game_id' column.
See Merge, join, concatenate and compare and Pandas Merging 101 for additional details.
import pandas as pd
# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)
# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
.drop('homeAway', axis=1)
.rename({'school': f'{loc}_school',
'conference': f'{loc}_conference',
'points': f'{loc}_points'}, axis=1))
for loc in df.homeAway.unique()]
# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])
# display(df_new)
game_id home_school home_conference home_points away_school away_conference away_points
0 332410041 Connecticut American Athletic 18 Towson CAA 33
1 662410041 NY a 1 CA b 2
2 772410041 FL c 3 AL d 4
How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6
When I'm working in SQL, I find almost all the things I do with a column are related to the following four operations:
Add a column.
Remove a column.
Change a column type.
What is the preferred way to do these three DML operations in pandas. For example, let's suppose I am starting with the following DataFrame:
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
How would I:
Change the df.id column from a string (or object as it says) to an int64 ?
Rename the column product to product_type ?
Add a new column called 'cost' with values [2.99, 3.99] ?
Remove the column called brand ?
Simple and complete:
import numpy as np
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
# Change the df.id column from a string (or object as it says) to an int64 ?
df['id'] = df['id'].astype(np.int64)
# Rename the column product to product_type ?
df = df.rename(columns={'product':'prouduct_type'})
# Add a new column called 'cost' with values [2.99, 3.99] ?
df['cost'] = pd.Series([2.99, 3.99])
# Remove the column called brand ?
df = df.drop(columns='brand')
This functions can also be chained together. I would not recommend it as it is not fixable as above:
# do all the steps above with a single line
df = df.astype({'id':np.int64},
axis=1
).rename(columns={'product':'prouduct_type'}
).assign(cost=[2.99, 3.99]
).drop(columns='brand')
There is also another way to which you can use inplace=True . This does the assignment. I don’t recommend it as it is not explicitly as the first method
# Using inplace=True
df['id'].astype(np.int64, inplace=True)
df.rename(columns={'product':'prouduct_type'}, inplace=True)
# No change from previous
df['cost'] = pd.Series([2.99, 3.99])
# pop brand out
df.pop('brand')
print(df)
You can perform these steps like this (starting with your original data frame):
# add a column
df = pd.concat([df, pd.Series([2.99, 3.99], name='cost')], axis=1)
# change column name
df = df.rename(columns={'product': 'product_type'})
# remove brand
df = df.drop(columns='brand')
# change data type
df['id'] = df['id'].astype('int')
print(df)
product_type id cost
0 drink 19 2.99
1 cup 11 3.99
you cound do:
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
df = (df.assign(cost=[2.99, 3.99],
id=lambda d: d.id.astype(int))
.drop(columns=['brand'])
.rename({"product": 'product_type'}, axis=1))
This should work
# change datatype
>>> df['id'] = df['id'].astype('int64')
>>> df.dtypes
brand object
id int64
product object
# rename column
df.rename(columns={'product': 'product_type'}, inplace=True)
>>> df
brand id product_type
0 spindrift 19 drink
1 None 11 cup
# create new column
df['Cost'] = pd.Series([2.99, 3.99])
>>> df
brand id product_type Cost
0 spindrift 19 drink 2.99
1 None 11 cup 3.99
# drop column
>>> df.drop(['brand'], axis=1, inplace=True)
>>> df
id product_type Cost
0 19 drink 2.99
1 11 cup 3.99
Below you see I have this object called westCountries, and right below you will see that I have a dataframe called countryDf.
westCountries = {'West': ['US', 'CA', 'PR']}
# countryDF
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
I am wondering how I can include the westCountries obj into my dataframe in a new column called Location? I have tried merging but that doesn't really do anything because oddly enough I need the value in this column to be the name of my key in my object as seen below. NOTE: This output is only an example, I understand there missing correlations with the data I provided and my desired output.
Country Location
0 US West
1 CA West
I was thinking of doing a few things such as:
using .isin() and then working with that dataframe to a few more transformations/computations to populate my dataframe, but this route seems a bit foggy to me.
using df.loc[...] to compare my dataframe with the values in this list and then I can create my own column with the value of my choice.
converting my object into a dataframe, and then creating a new column in this temporary dataframe and then merging by country so I can include the locations column into my countryDF dataframe.
However, I feel like there might be a more sophisticated solution than all these approaches I listed above. Which is why I'm reaching out for help.
Use pandas.DataFrame.explode to remove values from the list
Use a list comprehension to match values with the westCountries value list and return the key
For the example, the sample dataframe column values are created as strings and need to be converted to dict type with ast.literal_eval
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
westCountries = {'West': ['US', 'CA', 'PR']}
# remove the values from lists, with explode
df = df.explode('Country')
# create the Loc column using apply
df['Loc'] = df.Country.apply(lambda x: [k if x in v else None for k, v in westCountries.items()][0])
# drop rows with None
df = df.dropna()
# display(df)
Country Loc
0 US West
1 PR West
2 CA West
Option 2 (Better):
In the first option, for every row, .apply has to iterate through every key-value pair in westCountries using [k if x in v else None for k, v in westCountries.items()], which is slow.
It's better to reshape westCountries into a flat dict with region for the value and state as the key, using a dict comprehension.
Use pandas.Series.map to map the dict values into the new column
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
# remove the values from lists, with explode
df = df.explode('Country')
# given
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
# unpack westCountries where all values are keys and key are values
mapped = {x: k for k, v in westCountries.items() for x in v}
# print(mapped)
{'US': 'West', 'CA': 'West', 'PR': 'West', 'NY': 'East', 'NC': 'East'}
# map the dict to the column
df['Loc'] = df.Country.map(mapped)
# dropna
df = df.dropna()
You can use pd.melt and then explode the df using df.explode and df.merge
westCountries = {'West': ['US', 'CA', 'PR']}
west = pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
df.explode('Country').merge(west, on='Country')
Country Loc
0 US West
1 PR West
2 CA West
Details
pd.DataFrame(westCountries)
# West
#0 US
#1 CA
#2 PR
# Now melt the above dataframe
pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
# Loc Country
#0 West US
#1 West CA
#2 West PR
# Now, merge `df` after exploding with `west` on `Country`
df.explode('Country').merge(west, on='Country') # how = 'left' by default in merge
# Country Loc
#0 US West
#1 PR West
#2 CA West
EDIT:
if you have westCountries dict with unequal sizes then try this
from itertools import zip_longest
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
Example of the above:
df
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
4 [NY] #--> added `NY` from `East`.
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
# Country Loc
#0 US West
#1 PR West
#2 CA West
#3 NY East
this is probably not the fastest approach in terms of run time but it works
import pandas as pd
westCountries = {'West': ['US', 'CA', 'PR']}
df = pd.DataFrame(["[US]","[PR]", "[CA]", "[HK]"], columns=["Country"])
df = df.assign(Location="")
for index, row in df.iterrows():
if any([True for country in westCountries.get('West') if country in row['Country']]):
row.Location='West'
west_df = df[df['Location'] != ""]
Example Code & Output:
data_country1 = {'Country': [np.NaN, 'India', 'Brazil'],
'Capital': [np.NaN, 'New Delhi', 'Brasília'],
'Population': [np.NaN, 1303171035, 207847528]}
df_country1 = pd.DataFrame(data_country1, columns=['Country', 'Capital', 'Population'])
data_country2= {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [102283932, 1303171035, 207847528]}
df_country2 = pd.DataFrame(data_country2, columns=['Country', 'Capital', 'Population'])
print(df_country1)
print(df_country2)
Country Capital Population
0 NaN NaN NaN
1 India New Delhi 1.303171e+09
2 Brazil Brasília 2.078475e+08
Country Capital Population
0 Belgium Brussels 102283932
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
In the first DataFrame, for every row that is comprised of ALL NaN, I want to replace the entire row with a row from another dataframe. In this example, row 0 from the second dataframe, so that the first df ends up with the same information as the second dataframe.
You can find the rows that have NaN for all elements, and replace them with the rows of the other dataframe using:
# find the indices that are all NaN
na_indices = df_country1.index[df_country1.isnull().all(axis=1)]
# replace those indices with the values of the other dataframe
df_country1.loc[na_indices,:] = df_country2.loc[na_indices,:]
This assumes that the data frames are the same shape and you want to match on the missing rows.
I will join the two dataframes:
data_complete=pd.merge(df_country1.dropna(),df_country2,on=['Country','Capital','Population'],how='outer')
You can combine them using append, drop any duplicates (rows that were in both data frames), and then remove all the indices where the values are NaN:
#combine into one data frame with unique values
df_country = df_country1.append(df_country2,ignore_index=True).drop_duplicates()
#filter out NaN rows
df_country = df_country.drop(df_country.index[df_country.isnull().all(axis=1)])
The ignore_index flag in append gives each line unique indexes so that when you search for the indexes with NaN rows and it returns 0 you don't end up deleting the 0 index row from df_country2.