I'm trying to compare two dataframes. If the postcodes from both dataframes match then the corresponding ResidenceID (from dataframe 2) should be put into dataframe 1.I put this code together and it runs and gives me what I need but I need to simplify it so that the code runs faster. My dataset is 400k rows long and has 2 columns. Currently the code takes 4.5 mins to run. How can I make it run faster.Any help much appreciated :)
import time
start= time.time()
import pandas as pd
df1 = pd.read_excel('Residence_CCG_Check_M9.xlsx', sheet_name='Sheet1', dtype={'Postcode': str})
df2 = pd.read_excel('CCG_Codes.xlsx',sheet_name='East of England', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df3 = pd.read_excel('CCG_Codes.xlsx',sheet_name='London Commissioning Region', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df4 = pd.read_excel('CCG_Codes.xlsx',sheet_name='Midlands Commissioning Region', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df5 = pd.read_excel('CCG_Codes.xlsx',sheet_name='North East and Yorkshire Commis', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df6 = pd.read_excel('CCG_Codes.xlsx',sheet_name='North West Commissioning Region', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df7 = pd.read_excel('CCG_Codes.xlsx',sheet_name='South East Commissioning Region', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df8 = pd.read_excel('CCG_Codes.xlsx',sheet_name='South West Commissioning Region', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df9 = pd.read_excel('CCG_Codes.xlsx',sheet_name='Channel Islands and Isle of Man', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
df10 = pd.read_excel('CCG_Codes.xlsx',sheet_name='Wales, Scotland and Northern Ir', usecols=['Postcode', 'CCG'], dtype={'Postcode': str})
dfAll = pd.concat([df2,df3,df4,df5,df6,df7,df8,df9,df10])
merged_df = pd.merge(df1,dfAll, on='Postcode', how= 'left')
merged_df = merged_df.rename(columns={'CCG_y':'CCG'})
merged_df.to_excel(excel_writer="ResidenceCheck.xlsx", index=False, header=True)
end=time.time()
print(end-start)
Related
I have this dataframe
d = {'parameters': [{'Year': '2018',
'Median Age': 'nan',
'Total Non-Family Household Income': 289.0,
'Total Household Income': 719.0,
'Gini Index of Income Inequality': 0.4121}]}
df_sample = pd.DataFrame(data=d)
df_sample.head()
I want to convert that json into pandas columns. How do I do this? Assume I only have the dataframe not the parameter d
I saw this example
#which columns have json
#device
json_cols = ['device', 'geoNetwork', 'totals', 'trafficSource']
for column in json_cols:
c_load = test[column].apply(json.loads)
c_list = list(c_load)
c_dat = json.dumps(c_list)
test = test.join(pd.read_json(c_dat))
test = test.drop(column , axis=1)
But this does not seem too pythonic...
Use json_normalize:
df_sample = pd.json_normalize(data=d, record_path=['parameters'])
Resulting dataframe:
Year
Median Age
Total Non-Family Household Income
Total Household Income
Gini Index of Income Inequality
2018
nan
289.0
719.0
0.4121
UPD:
If you already have dataframe loaded, then applying pd.Series should work:
df_sample = df_sample['parameters'].apply(pd.Series) # or df_sample['parameters'].map(json.loads).apply(pd.Series) if data is not already dict
My goal below is to create 1 single column of all the individual words of each string in the 'Name' column.
Although I am achieving this, I am losing the column header on df = df['Name'].str.split(' ', expand=True) . I would like to preserve the header if possible so that I can refer to it later in the script.
I am also ending up with multiple indexes, which is fine, but if there is a way to not have this, it would be great.
Any help is appreciated greatly. Thank you
import pandas as pd
data = {'Name':['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ', expand=True)
df = df.stack(dropna=True)
print(df)
Try this:
data = {'Name': ['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ').explode().to_frame()
print(df)
Prints:
Name
0 Tom
0 Wilson
1 nick
1 snyder
2 krish
2 moham
3 jack
3 oconnell
I am trying to merge two dataframes and I'm struggling to get this setup right. I Googled for a solution before posting here, but I'm still stuck. This is what I'm working with.
import pandas as pd
# Intitialise data of lists
data1 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':6500},
{'ID': 577878, 'Year':2019, 'Type': 'IB', 'Expense':16500}]
df1 = pd.DataFrame(data1)
df1
data2 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':23000}]
df2 = pd.DataFrame(data2)
df2
df_final = pd.merge(df1,
df2,
left_on=['ID'],
right_on=['ID'],
how='inner')
df_final
This makes sense, but I don't want the 23000 duplicated.
If I do the merge like this.
df_final = pd.merge(df1,
df2,
left_on=['ID','Year'],
right_on=['ID','Year'],
how='inner')
df_final
This also makes sense, but now the 16500 is dropped off because there is no 2019 in df2.
How can I keep both records, but not duplicate the 23000?
My interpretation is that you just don't want to see 2 entries of 23000 for both 2019 and 2020. It should be for 2020 only.
You can use outer merge (with parameter how='outer') on 2 columns ID and Year, as follows:
df_final = pd.merge(df1,
df2,
on=['ID','Year'],
how='outer')
Result:
print(df_final)
ID Year Type_x Expense_x Type_y Expense_y
0 577878 2020 IB 6500 IB 23000.0
1 577878 2019 IB 16500 NaN NaN
Try, column filter the df2 not to merge in that column:
df1.merge(df2[['ID', 'Year', 'Type']], on=['ID'])
Output:
ID Year_x Type_x Expense Year_y Type_y
0 577878 2020 IB 6500 2020 IB
1 577878 2019 IB 16500 2020 IB
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a master data that I wish to do vlookups for additional columnns.
Here's what I am aiming to achieve:
Master Data:
Ctry Product
CN BTL
VN HP
Ref table 1:
Ctry Country
AU Australia
CN China
VN Vietnam
Ref table 2:
ProductID Product
BTL Bottles
HP Handphone
PRN Printer
How do I achieve it to combine all into the Master Data as below?
Expected Output:
Ctry Product Country Product
CN BTL China Bottles
VN HP Vietnam Handphone
My below codes only references 1 table and I'm stuck, how do I go about adding the additional columns to the existing Master Data Sheet?:
import pandas as pd
# IMPORT DATA
df1 = pd.read_excel("Masterdata.xlsx")
df2 = pd.read_excel("Ref_table_1.xlsx")
Left_join = pd.merge(df1,df2, on = 'Ctry', how ='left')
Left_join.to_excel("Output.xlsx", index = False)
You can use this and eliminate columns not required:
import pandas as pd
df1 = pd.DataFrame({'Ctry': ['CN', 'VN'], 'Product': ['BTL', 'HP']})
df2 = pd.DataFrame({'Ctry': ['CN', 'VN', 'AU'], 'Country': ['Australia', 'China', 'Vietnam']})
df3 = pd.DataFrame({'ProductID': ['BTL', 'HP', 'PRN'], 'Product': ['Bottle', 'HandPhone', 'Printer']})
m1 = df1.merge(df2, how='left')
m2 = m1.merge(df3, how='left', left_on='Product', right_on='ProductID')
print(m2)
Below you see I have this object called westCountries, and right below you will see that I have a dataframe called countryDf.
westCountries = {'West': ['US', 'CA', 'PR']}
# countryDF
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
I am wondering how I can include the westCountries obj into my dataframe in a new column called Location? I have tried merging but that doesn't really do anything because oddly enough I need the value in this column to be the name of my key in my object as seen below. NOTE: This output is only an example, I understand there missing correlations with the data I provided and my desired output.
Country Location
0 US West
1 CA West
I was thinking of doing a few things such as:
using .isin() and then working with that dataframe to a few more transformations/computations to populate my dataframe, but this route seems a bit foggy to me.
using df.loc[...] to compare my dataframe with the values in this list and then I can create my own column with the value of my choice.
converting my object into a dataframe, and then creating a new column in this temporary dataframe and then merging by country so I can include the locations column into my countryDF dataframe.
However, I feel like there might be a more sophisticated solution than all these approaches I listed above. Which is why I'm reaching out for help.
Use pandas.DataFrame.explode to remove values from the list
Use a list comprehension to match values with the westCountries value list and return the key
For the example, the sample dataframe column values are created as strings and need to be converted to dict type with ast.literal_eval
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
westCountries = {'West': ['US', 'CA', 'PR']}
# remove the values from lists, with explode
df = df.explode('Country')
# create the Loc column using apply
df['Loc'] = df.Country.apply(lambda x: [k if x in v else None for k, v in westCountries.items()][0])
# drop rows with None
df = df.dropna()
# display(df)
Country Loc
0 US West
1 PR West
2 CA West
Option 2 (Better):
In the first option, for every row, .apply has to iterate through every key-value pair in westCountries using [k if x in v else None for k, v in westCountries.items()], which is slow.
It's better to reshape westCountries into a flat dict with region for the value and state as the key, using a dict comprehension.
Use pandas.Series.map to map the dict values into the new column
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
# remove the values from lists, with explode
df = df.explode('Country')
# given
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
# unpack westCountries where all values are keys and key are values
mapped = {x: k for k, v in westCountries.items() for x in v}
# print(mapped)
{'US': 'West', 'CA': 'West', 'PR': 'West', 'NY': 'East', 'NC': 'East'}
# map the dict to the column
df['Loc'] = df.Country.map(mapped)
# dropna
df = df.dropna()
You can use pd.melt and then explode the df using df.explode and df.merge
westCountries = {'West': ['US', 'CA', 'PR']}
west = pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
df.explode('Country').merge(west, on='Country')
Country Loc
0 US West
1 PR West
2 CA West
Details
pd.DataFrame(westCountries)
# West
#0 US
#1 CA
#2 PR
# Now melt the above dataframe
pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
# Loc Country
#0 West US
#1 West CA
#2 West PR
# Now, merge `df` after exploding with `west` on `Country`
df.explode('Country').merge(west, on='Country') # how = 'left' by default in merge
# Country Loc
#0 US West
#1 PR West
#2 CA West
EDIT:
if you have westCountries dict with unequal sizes then try this
from itertools import zip_longest
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
Example of the above:
df
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
4 [NY] #--> added `NY` from `East`.
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
# Country Loc
#0 US West
#1 PR West
#2 CA West
#3 NY East
this is probably not the fastest approach in terms of run time but it works
import pandas as pd
westCountries = {'West': ['US', 'CA', 'PR']}
df = pd.DataFrame(["[US]","[PR]", "[CA]", "[HK]"], columns=["Country"])
df = df.assign(Location="")
for index, row in df.iterrows():
if any([True for country in westCountries.get('West') if country in row['Country']]):
row.Location='West'
west_df = df[df['Location'] != ""]