Iterate Pandas Chunks through preprocessing/cleaning before concatenating - python

Good day Python/Pandas Gurus:
I deal with memory issues while performing data analysis on my local machine. I typically deal with data in the shape of (15000000+, 50+). I typically chunk the data into chunksize=1000000 in pd.read_csv(), and this always works great for me.
I am wondering how I can iterate each chunk through my entire data cleaning/preprocessing section, so that I do not have to run the entire data frame through this section of code. I find I hit system limitations and run out of memory.
I want to read the pandas chunks, iterate each through a function or just a series of steps that renames columns, filters the data frame, and assign data types. Once this preprocessing is complete for all chunks, I would like the now processed chunks to then be concatenated together, creating the completed data frame.
df_chunks = pandas.read_csv("File.path", chunksize=10000)
for chunks in df_chunks:
Task 1: Rename Columns
Task 2: Filter(s)
Task 3: Assign data types to non-object fields
processed_df = pd.concat(df_chunks)
Here is a samples of code that I run the entire Data Frame through for preprocessing, but hit system limitations for the volume of data that I have:
billing_docs_clean.columns = ['BillingDocument', 'BillingDocumentItem', 'BillingDocumentType', 'BillingCategory', 'DocumentCategory',
'DocumentCurrency', 'SalesOrganization', 'DistributionChannel', 'PricingProcedure',
'DocumentConditionNumber', 'ShippingConditions', 'BillingDate', 'CustomerGroup', 'Incoterms',
'PostingStatus', 'PaymentTerms', 'DestinationCountry', 'Region', 'CreatedBy', 'CreationTime',
'SoldtoNumber', 'Curr1', 'Divison', 'Curr2', 'ExchangeRate', 'BilledQuantitySUn', 'SalesUnits',
'Numerator', 'Denominator', 'BilledQuantityBUn', 'BaseUnits', 'RequiredQuantity', 'BUn1', 'ExchangeRate2',
'ItemNetValue', 'Curr3', 'ReferenceDocument', 'ReferenceDocumentItem', 'ReferencyDocumentCategory',
'SalesDocument', 'SalesDocumentItem', 'Material', 'MaterialDescription', 'MaterialGroup',
'SalesDocumentItemCategory', 'SalesProductHierarchy', 'ShippingPoint', 'Plant', 'PlantRegion',
'SalesGroup', 'SalesOffice', 'Returns', 'Cost', 'Curr4', 'GrossValue', 'Curr5', 'NetValue', 'Curr6',
'CashDiscount', 'Curr7', 'FreightCharges', 'Curr8', 'Rebate', 'Curr9', 'OVCFreight', 'Curr10', 'ProfitCenter',
'CreditPrice', 'Curr11', 'SDDocumentCategory']
# Filter data to obtain US, Canada, and Mexico industrial sales for IFS Profit Center
billing_docs_clean = billing_docs_clean[
(billing_docs_clean['DistributionChannel'] == '02') &
(billing_docs_clean['ProfitCenter'].str.startswith('00001', na=False)) &
(billing_docs_clean['ReferenceDocumentItem'].astype(float) < 900000) &
(billing_docs_clean['PostingStatus']=='C') &
(billing_docs_clean['PricingProcedure'] != 'ZEZEFD') &
(billing_docs_clean['SalesDocumentItemCategory'] != 'TANN')]
# Correct Field Formats and data types
Date_Fields_billing_docs_clean = ['BillingDate']
for datefields in Date_Fields_billing_docs_clean:
billing_docs_clean[datefields] = pd.to_datetime(billing_docs_clean[datefields])
Trim_Zeros_billing_docs_clean = ['BillingDocument', 'BillingDocumentItem', 'ProfitCenter', 'Material', 'ReferenceDocument',
'ReferenceDocumentItem', 'SalesDocument', 'SalesDocumentItem']
for TrimFields in Trim_Zeros_billing_docs_clean:
billing_docs_clean[TrimFields] = billing_docs_clean[TrimFields].str.lstrip('0')
Numeric_Fields_billing_docs_clean = ['ExchangeRate', 'BilledQuantitySUn', 'Numerator', 'Denominator', 'BilledQuantityBUn',
'RequiredQuantity', 'ExchangeRate2', 'ItemNetValue', 'Cost', 'GrossValue', 'NetValue',
'CashDiscount', 'FreightCharges', 'Rebate', 'OVCFreight', 'CreditPrice']
for NumericFields in Numeric_Fields_billing_docs_clean:
billing_docs_clean[NumericFields] = billing_docs_clean[NumericFields].astype('str').str.replace(',','').astype(float)
I am still relatively new with python coding for data analytics, but eager to learn! So I appreciate any and all explanations or any other recommendations for the code in this post.
Thanks!

Task 1: Rename Columns
For this you can harness pandas.read_csv optional arguments header and names. Consider following simple example, let file.csv content be
A,B,C
1,2,3
4,5,6
then
import pandas as pd
df = pd.read_csv("file.csv", header=0,names=["X","Y","Z"])
print(df)
output
X Y Z
0 1 2 3
1 4 5 6

Related

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

newbie python learner here!
I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false.
Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? :
trialnum,colourtext,colourname,condition,response,rt,correct
1,blue,red,incongruent,red,0.767041,True
2,yellow,yellow,congruent,yellow,0.647259,True
3,green,blue,incongruent,blue,0.990185,True
4,green,green,congruent,green,0.720116,True
5,yellow,yellow,congruent,yellow,0.562909,True
6,yellow,yellow,congruent,yellow,0.538918,True
7,green,yellow,incongruent,yellow,0.693017,True
8,yellow,red,incongruent,red,0.679368,True
9,yellow,blue,incongruent,blue,0.951432,True
10,blue,blue,congruent,blue,0.633367,True
11,blue,green,incongruent,green,1.289047,True
12,green,green,congruent,green,0.668142,True
13,blue,red,incongruent,red,0.647722,True
14,red,blue,incongruent,blue,0.858307,True
15,red,red,congruent,red,1.820112,True
16,blue,green,incongruent,green,1.118404,True
17,red,red,congruent,red,0.798532,True
18,red,red,congruent,red,0.470939,True
19,red,blue,incongruent,blue,1.142712,True
20,red,red,congruent,red,0.656328,True
21,red,yellow,incongruent,yellow,0.978830,True
22,green,red,incongruent,red,1.316182,True
23,yellow,yellow,congruent,green,0.964292,False
24,green,green,congruent,green,0.683949,True
25,yellow,green,incongruent,green,0.583939,True
26,green,blue,incongruent,blue,1.474140,True
27,green,blue,incongruent,blue,0.569109,True
28,green,green,congruent,blue,1.196470,False
29,red,red,congruent,red,4.027546,True
30,blue,blue,congruent,blue,0.833177,True
31,red,red,congruent,red,1.019672,True
32,green,blue,incongruent,blue,0.879507,True
33,red,red,congruent,red,0.579254,True
34,red,blue,incongruent,blue,1.070518,True
35,blue,yellow,incongruent,yellow,0.723852,True
36,yellow,green,incongruent,green,0.978838,True
37,blue,blue,congruent,blue,1.038232,True
38,yellow,green,incongruent,yellow,1.366425,False
39,green,red,incongruent,red,1.066038,True
40,blue,red,incongruent,red,0.693698,True
41,red,blue,incongruent,blue,1.751062,True
42,blue,blue,congruent,blue,0.449651,True
43,green,red,incongruent,red,1.082267,True
44,blue,blue,congruent,blue,0.551023,True
45,red,blue,incongruent,blue,1.012258,True
46,yellow,green,incongruent,yellow,0.801443,False
47,blue,blue,congruent,blue,0.664119,True
48,red,green,incongruent,yellow,0.716189,False
49,green,green,congruent,yellow,0.630552,False
50,green,yellow,incongruent,yellow,0.721917,True
51,red,red,congruent,red,1.153943,True
52,blue,red,incongruent,red,0.571019,True
53,yellow,yellow,congruent,yellow,0.651611,True
54,blue,blue,congruent,blue,1.321344,True
55,green,green,congruent,green,1.159240,True
56,blue,blue,congruent,blue,0.861646,True
57,yellow,red,incongruent,red,0.793069,True
58,yellow,yellow,congruent,yellow,0.673190,True
59,yellow,red,incongruent,red,1.049320,True
60,red,yellow,incongruent,yellow,0.773447,True
61,red,yellow,incongruent,yellow,0.693554,True
62,red,red,congruent,red,0.933901,True
63,blue,blue,congruent,blue,0.726794,True
64,green,green,congruent,green,1.046116,True
65,blue,blue,congruent,blue,0.713565,True
66,blue,blue,congruent,blue,0.494177,True
67,green,green,congruent,green,0.626399,True
68,blue,blue,congruent,blue,0.711896,True
69,blue,blue,congruent,blue,0.460420,True
70,green,green,congruent,yellow,1.711978,False
71,blue,blue,congruent,blue,0.634218,True
72,yellow,blue,incongruent,yellow,0.632482,False
73,yellow,yellow,congruent,yellow,0.653813,True
74,green,green,congruent,green,0.808987,True
75,blue,blue,congruent,blue,0.647117,True
76,green,red,incongruent,red,1.791693,True
77,red,yellow,incongruent,yellow,1.482570,True
78,red,red,congruent,red,0.693132,True
79,red,yellow,incongruent,yellow,0.815830,True
80,green,green,congruent,green,0.614441,True
81,yellow,red,incongruent,red,1.080385,True
82,red,green,incongruent,green,1.198548,True
83,blue,green,incongruent,green,0.845769,True
84,yellow,blue,incongruent,blue,1.007089,True
85,green,blue,incongruent,blue,0.488701,True
86,green,green,congruent,yellow,1.858272,False
87,yellow,yellow,congruent,yellow,0.893149,True
88,yellow,yellow,congruent,yellow,0.569597,True
89,yellow,yellow,congruent,yellow,0.483542,True
90,yellow,red,incongruent,red,1.669842,True
91,blue,green,incongruent,green,1.158416,True
92,blue,red,incongruent,red,1.853055,True
93,green,yellow,incongruent,yellow,1.023785,True
94,yellow,blue,incongruent,blue,0.955395,True
95,yellow,yellow,congruent,yellow,1.303260,True
96,blue,yellow,incongruent,yellow,0.737741,True
97,yellow,green,incongruent,green,0.730972,True
98,green,red,incongruent,red,1.564596,True
99,yellow,yellow,congruent,yellow,0.978911,True
100,blue,yellow,incongruent,yellow,0.508151,True
101,red,green,incongruent,green,1.821969,True
102,red,red,congruent,red,0.818726,True
103,yellow,yellow,congruent,yellow,1.268222,True
104,yellow,yellow,congruent,yellow,0.585495,True
105,green,green,congruent,green,0.673404,True
106,blue,yellow,incongruent,yellow,1.407036,True
107,red,red,congruent,red,0.701050,True
108,red,green,incongruent,red,0.402334,False
109,red,green,incongruent,green,1.537681,True
110,green,yellow,incongruent,yellow,0.675118,True
111,green,green,congruent,green,1.004550,True
112,yellow,blue,incongruent,blue,0.627439,True
113,yellow,yellow,congruent,yellow,1.150248,True
114,blue,yellow,incongruent,yellow,0.774452,True
115,red,red,congruent,red,0.860966,True
116,red,red,congruent,red,0.499595,True
117,green,green,congruent,green,1.059725,True
118,red,red,congruent,red,0.593180,True
119,green,yellow,incongruent,yellow,0.855915,True
120,blue,green,incongruent,green,1.335018,True
But I am only interested in the 'condition', 'rt', and 'correct' columns.
I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table:
Participant
Stimulus Type
Mean Reaction Time
Percentage Correct
01
Congruent
0.560966
80
01
Incongruent
0.890556
64
02
Congruent
0.460576
89
02
Incongruent
0.956556
55
Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice!
I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants?
Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table:
import os
import glob
import pandas as pd
#set working directory
os.chdir('data')
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table?
Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes?
I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this:
from pathlib import Path
# Use the Path class to represent a path. It offers more
# functionalities when perform operations on paths
path = Path("./data").resolve()
# Create a dictionary whose keys are the Participant ID
# (the `01` in `P01.csv`, etc), and whose values are
# the data frames initialized from the CSV
data = {
p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv")
}
# Create a master data frame by combining the individual
# data frames from each CSV file
df = pd.concat(data, keys=data.keys(), names=["participant", None])
# Calculate the statistics
result = (
df.groupby(["participant", "condition"]).agg(**{
"Mean Reaction Time": ("rt", "mean"),
"correct": ("correct", "sum"),
"size": ("trialnum", "size")
}).assign(**{
"Percentage Correct": lambda x: x["correct"] / x["size"]
}).drop(columns=["correct", "size"])
.reset_index()
)

Sorting in pandas by multiple column without distorting index

I have just started out with Pandas and I am trying to do a multilevel sorting of data by columns. I have four columns in my data: STNAME, CTYNAME, CENSUS2010POP, SUMLEV. I want to set the index of my data by columns: STNAME, CTYNAME and then sort the data by CENSUS2010POP. After I set the index the appears like in pic 1 (before sorting by CENSUS2010POP) and when I sort and the data appears like pic 2 (After sorting). You can see Indices are messy and no longer sorted serially.
I have read out a few posts including this one (Sorting a multi-index while respecting its index structure) which dates back to five years ago and does not work while I write them. I am yet to learn the group by function.
Could you please tell me a way I can achieve this?
ps: I come from a accounting/finance background and very new to coding. I have just completed two Python course including PY4E.com
used this below code to set the index
census_dfq6 = census_dfq6.set_index(['STNAME','CTYNAME'])
and, used the below code to sort the data:
census_dfq6 = census_dfq6.sort_values (by = ['CENSUS2010POP'], ascending = [False] )
sample data I am working, I would love to share the csv file but I don't see a way to share this.
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50
Required End Result:
STNAME,CTYNAME,CENSUS2010POP,SUMLEV
Alabama,Autauga County,54571,50
Alabama,Baldwin County,182265,50
Alabama,Barbour County,27457,50
Alabama,Bibb County,22915,50
Alabama,Blount County,57322,50
Alaska,Aleutians East Borough,3141,50
Alaska,Aleutians West Census Area,5561,50
Alaska,Anchorage Municipality,291826,50
Alaska,Bethel Census Area,17013,50
Wyoming,Platte County,8667,50
Wyoming,Sheridan County,29116,50
Wyoming,Sublette County,10247,50
Wyoming,Sweetwater County,43806,50
Wyoming,Teton County,21294,50
Wyoming,Uinta County,21118,50
Wyoming,Washakie County,8533,50
Wyoming,Weston County,7208,50

Pandas dataframe CSV reduce disk size

for my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv:
and this is my code:
# drop all features we don't need
for attribute in df:
if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'):
df = df.drop(attribute, axis=1)
# create a dictionary of airports, each airport has the following structure:
# IATA : (NAME, COUNTRY, LAT, LNG)
airport_dict = {}
for airport in df.itertuples():
airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5])
# From tutorial 4 soulution:
airportcodes=list(airport_dict)
airportdists=pd.DataFrame()
for i, airport_code1 in enumerate(airportcodes):
airport1 = airport_dict[airport_code1]
dists=[]
for j, airport_code2 in enumerate(airportcodes):
if j > i:
airport2 = airport_dict[airport_code2]
dists.append(distanceBetweenAirports(airport1[2],airport1[3],airport2[2],airport2[3]))
else:
# little edit: no need to calculate the distance twice, all duplicates are set to 0 distance
dists.append(0)
airportdists[i]=dists
airportdists.columns=airportcodes
airportdists.index=airportcodes
# set all 0 distance values to NaN
airportdists = airportdists.replace(0, np.nan)
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv')
I also tried re-indexing it before saving:
# remove all NaN values
airportdists = airportdists.stack().reset_index()
airportdists.columns = ['airport1','airport2','distance']
but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement...
Can you help me shrink the size of my csv? Thank you!
I have done a similar application in the past; here's what I will do:
It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport.
In this case the loading of file is really fast.
My suggestion will be instead of storing as a CSV try to store in Key Value pair data structure like JSON. It will be very fast on retrieval. Or try parquet file format that will consume 1/4 of the CSV file storage.
import pandas as pd
import numpy as np
from pathlib import Path
from string import ascii_letters
#created a dataframe
df = pd.DataFrame(np.random.randint(0,10000,size=(1000000, 52)),columns=list(ascii_letters))
df.to_csv('csv_store.csv',index=False)
print('CSV Consumend {} MB'.format(Path('csv_store.csv').stat().st_size*0.000001))
#CSV Consumend 255.22423999999998 MB
df.to_parquet('parquate_store',index=False)
print('Parquet Consumed {} MB'.format(Path('parquate_store').stat().st_size*0.000001))
#Parquet Consumed 93.221154 MB
The title of the question, "..reduce disk size" is solved by outputting a compressed version of the csv.
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv', compression='zip')
Or one better with Pandas 0.24.0
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv.zip')
You will find the csv is hugely compressed.
This of course does not solve for optimizing load and save time and does nothing for working memory. But hopefully useful when disk space is at a premium or cloud storage is being paid for.
The best compression would be to instead store the latitude and longitude of each airport, and then compute the distance between any two pairs on demand. Say, two 32-bit floating point values for each airport and the identifier, which would be about 110K bytes. Compressed by a factor of about 1300.

Matching cells in CSV to return calculation

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!
I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.
I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

How to read through large csv or database and join columns when memory is an issue?

I have a large dataset that I pulled from Data.Medicare.gov (https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6)
It's a cvs of all physicians (2.4 million rows by 41 columns, 750MB), lets call this physician_df, however, I cannot load into memory on my computer (memory error).
I have another df loaded in memory (summary_df) and I want to join columns (NPI, Last Name, First Name) from physician_df.
Is there any way to do this without having to load the data to memory? I first attempted by using their API but I get capped out (I have about 500k rows in my final df and this will always be changing). Would storing the physician_df into a SQL database make this easier?
Here are snippets of each df (fyi, the summary_df is all fake information).
summary_df
DOS Readmit SurgeonNPI
1-1-2018 1 1184809691
2-2-2018 0 1184809691
2-5-2017 1 1093707960
physician_df
NPI PAC ID Professional Enrollment LastName FirstName
1184809691 2668563156 I20120119000086 GOLDMAN SALUJA
1184809691 4688750714 I20080416000055 NOLTE KIMBERLY
1093707960 7618879354 I20040127000771 KHANDUJA KARAMJIT
Final df:
DOS Readmit SurgeonNPI LastName FirstName
1-1-2018 1 1184809691 GOLDMAN SALUJA
2-2-2018 0 1184809691 GOLDMAN SALUJA
2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If I could load the physician_df then I would use the below code..
pandas.merge(summary_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
For your desired output, you only need 3 columns from physician_df. It is more likely 2.4mio rows of 3 columns can fit in memory versus 5 (or, of course, all 41 columns).
So I would first try extracting what you need from a 3-column dataset, convert to a dictionary, then use it to map required columns.
Note, to produce your desired output, it is necessary to drop duplicates (keeping first) from physicians_df, so I have included this logic.
from operator import itemgetter as iget
d = pd.read_csv('physicians.csv', columns=['NPI', 'LastName', 'FirstName'])\
.drop_duplicates('NPI')\
.set_index('NPI')[['LastName', 'FirstName']]\
.to_dict(orient='index')
# {1093707960: {'FirstName': 'KARAMJIT', 'LastName': 'KHANDUJA'},
# 1184809691: {'FirstName': 'SALUJA', 'LastName': 'GOLDMAN'}}
df_summary['LastName'] = df_summary['SurgeonNPI'].map(d).map(iget('LastName'))
df_summary['FirstName'] = df_summary['SurgeonNPI'].map(d).map(iget('FirstName'))
# DOS Readmit SurgeonNPI LastName FirstName
# 0 1-1-2018 1 1184809691 GOLDMAN SALUJA
# 1 2-2-2018 0 1184809691 GOLDMAN SALUJA
# 2 2-5-2017 1 1093707960 KHANDUJA KARAMJIT
If your final dataframe is too large to store in memory, then I would consider these options:
Chunking: split your dataframe into small chunks and output as you go along.
PyTables: based on numpy + HDF5.
dask.dataframe: based on pandas and uses out-of-core processing.
I would try to import the data into a database and do the joins there (e.g. Postgres if you want a relational DB – there are pretty nice ORMs for it, like peewee). Maybe you can then use SQL operations to get a subset of the data you are most interested in, export it and can process it using Pandas. Also, take a look at Ibis for working with databases directly – another project Wes McKinney, the author of Pandas worked on.
It would be great to use Pandas with an on-disk storage system, but as far as I know that's not an entirely solved problem yet. There's PyTables (a bit more on using PyTables with Pandas here), but it doesn't support joins in the same SQL-like way that Pandas does.
Sampling!
import pandas as pd
import random
n = int(2.4E7)
n_sample = 2.4E5
filename = "https://data.medicare.gov/Physician-Compare/Physician-Compare-National-Downloadable-File/mj5m-pzi6"
skip = sorted(random.sample(xrange(n),n-s))
physician_df = pd.read_csv(filename, skiprows=skip)
Then this should work fine
summary_sample_df = summary_df[summary_df.SurgeonNPI.isin(physician_df.NPI)]
merge_sample_df = pd.merge(summary_sample_df, physician_df, how='left', left_on=['SurgeonNPI'], right_on=['NPI'])
Pickle your merge_sample_df. Sample again. Wash, rinse, repeat to desired confidence.

Categories