Related
I have two dataframes that I need to combine together based on a key (an 'incident number'). The key, however, is repeated, as the database they will be ingested by requires a particular format for coordinates. How can join the necessary columns based on a combination of keys?
For example, the two tables look like:
Incident_Number
Lat/Long
GPSCoordinates
AB123
Lat
32.123
AB123
Long
120.123
CD321
Lat
31.321
CD321
Long
121.321
and...
Incident_Number
Lat/Long
GeoCodeCoordinates
AB123
Lat
35.123
AB123
Long
125.123
CD321
Lat
36.321
CD321
Long
126.321
And I need to get to...
IncidentNumber
Lat/Long
GPSCoordinates
GeoCodeCoordinates
AB123
Lat
32.123
35.123
AB123
Long
120.123
125.123
CD321
Lat
31.321
36.321
CD321
Long
121.321
126.321
The number of records are not 100% equal in each table so it needs to allow for NaNs. I am essentially trying to add the column 'GeoCodeCoordinates' to the other dataframe on a combination of 'Incident Number' and 'Lat/Long', so it will treat the value 'AB123 + Lat' and 'AB123 + Long' as a single key. Can this be specified within code, or does a new column and a calculation to create that value as a key need to be created?
I imagine I went about this in a bit of a goofy way. The Lat and Long were originally stored in separate fields and I used .melt() to make the data longer. The database that will ultimately take this in requires the longer format for the Lat/Long field.
GPSColList = list(GPSRecords.columns)
GPSColList.remove('Latitude')
GPSList.remove('Longitude')
GPSMelt = GPSRecords.melt(id_vars=GPSColList, value_vars=['Latitude', 'Longitude'], var_name='Lat/Long', value_name="GPSCoordinates")
As the two sets of coordinates were in separate fields I created two dataframes with each set of coordinates and melted them separately. My attempt to merge them looks like:
mergeMelt = pd.merge(GPSMelt, GeoCodeMelt[["GeoCodeCoordinates"]], on=['Incident_Number', 'Lat/Long'])
Result is KeyError: 'Incident_Number'
Adding samples as requested:
geocodeMelt:
print(geocodeMelt.head(10).to_dict())
{'OID_': {0: 5211, 1: 5212, 2: 5213, 3: 5214, 4: 5215, 5: 5216, 6: 5217, 7: 5218, 8: 5219, 9: 5220}, 'Unit_Level': {0: 'RRU (Riverside
Unit)', 1: 'RRU (Riverside Unit)', 2: 'RRU (Riverside Unit)', 3: 'RRU (Riverside Unit)', 4: 'RRU (Riverside Unit)', 5: 'RRU (Riverside
Unit)', 6: 'RRU (Riverside Unit)', 7: 'RRU (Riverside Unit)', 8: 'RRU (Riverside Unit)', 9: 'RRU (Riverside Unit)'}, 'Agency_FDID': {0: 33090, 1: 33051, 2: 33054, 3: 33054, 4: 33090, 5: 33070, 6: 33030, 7: 33054, 8: 33090, 9: 33052}, 'Incident_Number': {0: '21CARRU0000198', 1: '21CARRU0000564', 2: '21CARRU0000523', 3: '21CARRU0000624', 4: '21CARRU0000436', 5: '21CARRU0000439', 6: '21CARRU0000496', 7: '21CARRU0000422', 8: '21CARRU0000466', 9: '21CARRU0000016'}, 'Exposure': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'CAD_Incident_Type': {0: '71', 1: '67B01O', 2: '71C01', 3: '69D03', 4: '67', 5: '67', 6: '71', 7: '69D06', 8: '71C01', 9: '82B01'}, 'CALFIRS_Incident_Type': {0: 'Passenger vehicle fire', 1: 'Outside rubbish, trash or waste fire', 2: 'Passenger vehicle fire', 3: 'Building fire', 4: 'Outside rubbish, trash or waste fire', 5: 'Outside rubbish, trash or waste fire', 6: 'Passenger vehicle fire', 7: 'Dumpster or other outside trash receptacle fire', 8: 'Passenger vehicle fire', 9: 'Brush or brush-and-grass mixture fire'}, 'Incident_Date': {0: '1/1/2021 0:00:00', 1: '1/1/2021 0:00:00', 2: '1/1/2021 0:00:00', 3: '1/1/2021 0:00:00', 4: '1/1/2021 0:00:00', 5: '1/1/2021 0:00:00', 6: '1/1/2021 0:00:00', 7: '1/1/2021 0:00:00', 8: '1/1/2021 0:00:00', 9: '1/1/2021 0:00:00'}, 'Report_Date_Time': {0: nan, 1: '1/1/2021 20:34:00', 2: '1/1/2021 19:07:00', 3: '1/1/2021 23:33:00', 4: nan, 5: '1/1/2021 16:56:00', 6: '1/1/2021 18:28:00', 7: '1/1/2021 16:16:00', 8: '1/1/2021 17:40:00', 9: '1/1/2021 0:15:00'}, 'Day': {0: '06 - Friday', 1: '06 - Friday', 2: '06 - Friday', 3: '06 - Friday', 4: '06 - Friday', 5: '06 - Friday', 6: '06 - Friday', 7: '06 - Friday', 8: '06 - Friday', 9: '06 - Friday'}, 'Incident_Name': {0: 'HY 91 W/ SERFAS CLUB DR', 1: 'QUAIL PL MENI', 2: 'CAR', 3: 'SUNNY', 4: 'MARTINEZ RD SANJ', 5: 'W METZ RD / ALTURA DR', 6: 'PALM DR / BUENA VISTA AV', 7: 'DELL', 8: 'HY 74 E HEM', 9: 'MADISON ST / AVE 60'}, 'Address': {0: 'HY 91 W Corona CA 92880', 1: '23880 KENNEDY LN Menifee CA 92587', 2: 'THEODORE ST/EUCALYPTUS AV Moreno Valley CA 92555', 3: '24490 SUNNYMEAD Moreno Valley CA 92553', 4: '40300 MARTINEZ San Jacinto CA 92583', 5: '1388 West METZ Perris CA 92570', 6: 'PALM DR/BUENA VISTA AV Desert hot springs CA 92240', 7: '25361 DELPHINIUM Moreno Valley CA 92553', 8: '43763 HY 74 East Hemet CA 92544', 9: 'MADISON ST/AVE 60 La Quinta CA 92253'}, 'Acres_Burned': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 0.01}, 'Wildland_Fire_Cause': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 'UU - Undetermined'}, 'Latitude_D': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7:
nan, 8: nan, 9: nan}, 'Longitude_D': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'Member_Making_Report': {0: 'Muhammad Nassar', 1: 'TODD PHILLIPS', 2: 'DAVID COLOMBO', 3: 'GREGORY MOWAT', 4: 'MICHAEL ESPARZA', 5: 'Benjamin Hall', 6: 'TIMOTHY CABRAL', 7: 'JORGE LOMELI', 8: 'JOSHUA BALBOA', 9: 'SETH SHIVELY'}, 'Battalion': {0: 4.0, 1: 13.0, 2: 9.0, 3: 9.0, 4: 5.0, 5: 1.0, 6: 10.0, 7: 9.0, 8: 5.0, 9: 6.0}, 'Incident_Status': {0: 'Submitted', 1: 'Submitted', 2: 'Submitted', 3: 'Submitted', 4: 'Submitted', 5: 'Submitted', 6: 'Submitted', 7: 'Submitted', 8: 'Submitted', 9: 'Submitted'}, 'DDLat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'DDLon': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'DiscrepancyDistanceFeet': {0: 4178.0, 1: 107.0, 2: 2388.0, 3: 233159.0, 4: 102.0, 5: 1768.0, 6: 1094.0, 7: 78.0, 8: 35603721.0, 9: 149143.0}, 'DiscrepancyDistanceMiles': {0: 1.0, 1: 0.0, 2: 0.0, 3: 44.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 6743.0, 9: 28.0}, 'DiscrepancyGreaterThan1000ft': {0: 1.0, 1: 2.0, 2: 1.0, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: 2.0, 8: 1.0, 9: 1.0}, 'LocationLegitimate': {0: nan, 1: 1.0, 2: nan, 3: nan, 4: 1.0, 5: nan, 6: nan, 7: 1.0, 8: nan, 9: nan}, 'LocationErrorCategory': {0: nan, 1: 7.0, 2: nan, 3: nan, 4: 7.0,
5: nan, 6: nan, 7: 7.0, 8: nan, 9: nan}, 'LocationErrorComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'LocationErrorResolution': {0: nan, 1: 6.0, 2: nan, 3: nan, 4: 6.0, 5: nan, 6: nan, 7: 6.0, 8: nan, 9: nan}, 'LocationErrorResolutionComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLatitudeDDM': {0: '33 53.0746416', 1: '33 42.3811205', 2: '33 55.9728055', 3: '33 56.3706594', 4: '33 47.9788195', 5: '33 47.6486387', 6: '33 57.5747994', 7: '33 54.3721212', 8: '33 44.8499992', 9: '33 38.1589793'}, 'CADLongitudeDDM': {0: '-117 38.2368024', 1: '-117 14.5374611', 2: '-117 07.9119009', 3: '-117 14.1319211', 4: '-116 57.4446600', 5: '-117 15.4013420', 6: '-116 30.2784078', 7: '-117 13.2052213', 8: '-116 53.8524596',
9: '-116 15.0473995'}, 'GeocodeSymbology': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2}, 'Lat/Long': {0: 'Latitude', 1: 'Latitude', 2: 'Latitude', 3: 'Latitude', 4: 'Latitude', 5: 'Latitude', 6: 'Latitude', 7: 'Latitude', 8: 'Latitude', 9: 'Latitude'}, 'CAD_Coords': {0: '33 52.924', 1: '33 42.364', 2: '33 56.100', 3: '33 93.991', 4: '33 47.9629', 5: '33 47.390', 6: '33 57.573', 7: '33 54.385', 8: '33 44.859', 9: '33 61.269'}}
and GPSMelt:
print(geocodeMelt.head(10).to_dict())
{'OID_': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10}, 'Unit_Level': {0: 'RRU (Riverside Unit)', 1: 'RRU (Riverside Unit)', 2: 'RRU (Riverside Unit)', 3: 'RRU (Riverside Unit)', 4: 'RRU (Riverside Unit)', 5: 'RRU (Riverside Unit)', 6: 'RRU (Riverside Unit)', 7: 'RRU (Riverside Unit)', 8: 'RRU (Riverside Unit)', 9: 'RRU (Riverside Unit)'}, 'Agency_FDID': {0: 33090, 1: 33054, 2: 33030, 3: 33051, 4: 33054, 5: 33090, 6: 33070, 7: 33054, 8: 33090, 9: 33035}, 'Incident_Number': {0: '21CARRU0000198', 1: '21CARRU0000523', 2: '21CARRU0000496', 3: '21CARRU0000564', 4: '21CARRU0000624', 5: '21CARRU0000436', 6: '21CARRU0000439', 7: '21CARRU0000422', 8: '21CARRU0000466', 9: '21CARRU0000007'}, 'Exposure': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'CAD_Incident_Type': {0: '71', 1: '71C01', 2: '71', 3: '67B01O', 4: '69D03', 5: '67', 6: '67', 7: '69D06', 8: '71C01', 9: '82C03'}, 'CALFIRS_Incident_Type': {0: 'Passenger vehicle fire', 1: 'Passenger vehicle fire', 2: 'Passenger vehicle fire', 3: 'Outside rubbish, trash or waste fire', 4: 'Building fire', 5: 'Outside rubbish, trash or waste fire', 6: 'Outside rubbish, trash or waste fire', 7: 'Dumpster or other outside trash receptacle fire', 8: 'Passenger vehicle fire', 9: 'Brush or brush-and-grass mixture fire'}, 'Incident_Date': {0: '1/1/2021 0:00:00', 1: '1/1/2021 0:00:00', 2: '1/1/2021 0:00:00', 3: '1/1/2021 0:00:00', 4: '1/1/2021 0:00:00', 5: '1/1/2021 0:00:00', 6: '1/1/2021 0:00:00', 7: '1/1/2021 0:00:00', 8: '1/1/2021 0:00:00', 9: '1/1/2021 0:00:00'}, 'Report_Date_Time': {0: nan, 1: '1/1/2021 19:07:00', 2: '1/1/2021 18:28:00', 3: '1/1/2021 20:34:00', 4: '1/1/2021 23:33:00', 5: nan, 6: '1/1/2021 16:56:00', 7: '1/1/2021 16:16:00', 8: '1/1/2021 17:40:00', 9: '1/1/2021 0:07:00'}, 'Day': {0: '06 - Friday', 1: '06 - Friday', 2: '06 - Friday', 3: '06 - Friday', 4: '06 - Friday', 5: '06 - Friday', 6: '06 - Friday', 7: '06 - Friday', 8: '06 - Friday', 9: '06 - Friday'}, 'Incident_Name': {0: 'HY 91 W/ SERFAS CLUB DR', 1: 'CAR', 2: 'PALM DR / BUENA VISTA AV', 3: 'QUAIL PL MENI', 4: 'SUNNY', 5: 'MARTINEZ RD SANJ', 6: 'W METZ RD / ALTURA DR', 7: 'DELL', 8: 'HY 74 E HEM', 9: 'RIVERSIDE DR / JOY ST'}, 'Address': {0: 'HY 91 W Corona CA 92880', 1: 'THEODORE ST/EUCALYPTUS AV Moreno Valley CA 92555', 2: 'PALM DR/BUENA VISTA AV Desert hot springs CA 92240', 3: '23880 KENNEDY LN Menifee CA 92587', 4: '24490 SUNNYMEAD Moreno Valley CA 92553', 5: '40300 MARTINEZ San Jacinto CA 92583', 6: '1388 West METZ Perris CA 92570', 7: '25361 DELPHINIUM Moreno Valley CA 92553', 8: '43763 HY 74 East Hemet CA 92544', 9: 'RIVERSIDE DR/JOY ST Lake Elsinore CA 92530'}, 'Acres_Burned': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 1.0}, 'Wildland_Fire_Cause': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: 'Misuse of Fire by a Minor'}, 'Latitude_D': {0: 33.88206666666667, 1: 33.935, 2: 33.95955, 3: 33.706066666666665, 4: 34.566516666666665, 5: 33.79938166666667, 6: 33.789833333333334, 7: 33.906416666666665, 8: 33.74765, 9: 33.679883333333336}, 'Longitude_D': {0: -117.62385, 1: -117.13931666666667, 2: -116.50103333333333, 3: -117.2422, 4: -117.39321666666666, 5: -116.9573, 6: -117.254, 7: -117.22008333333332, 8: 116.89728333333332, 9: -117.37076666666665}, 'Member_Making_Report': {0: 'Muhammad Nassar', 1: 'DAVID COLOMBO', 2: 'TIMOTHY CABRAL', 3: 'TODD PHILLIPS', 4: 'GREGORY MOWAT', 5: 'MICHAEL ESPARZA', 6: 'Benjamin Hall', 7: 'JORGE LOMELI', 8: 'JOSHUA BALBOA', 9: 'KEVIN MERKH'}, 'Battalion': {0: 4.0, 1: 9.0, 2: 10.0, 3: 13.0, 4: 9.0, 5: 5.0, 6: 1.0, 7: 9.0, 8: 5.0, 9: 2.0}, 'Incident_Status': {0: 'Submitted', 1: 'Submitted', 2: 'Submitted', 3: 'Submitted', 4: 'Submitted', 5: 'Submitted', 6: 'Submitted', 7: 'Submitted', 8: 'Submitted', 9: 'Submitted'}, 'DDLat': {0: '33.88206667N', 1: '33.93500000N', 2: '33.95955000N', 3: '33.70606667N', 4: '34.56651667N', 5: '33.79938167N', 6: '33.78983333N', 7: '33.90641667N', 8: '33.74765000N', 9: '33.67988333N'}, 'DDLon': {0: '117.62385000W', 1: '117.13931667W', 2: '116.50103333W', 3: '117.24220000W', 4: '117.39321667W', 5: '116.95730000W', 6: '117.25400000W', 7: '117.22008333W', 8: '116.89728333E', 9: '117.37076667W'}, 'DiscrepancyDistanceFeet': {0: 4178.0, 1: 2388.0, 2: 1094.0, 3: 107.0, 4: 233159.0, 5: 102.0, 6: 1768.0, 7: 78.0, 8: 35603721.0, 9: 9298.0}, 'DiscrepancyDistanceMiles': {0: 1.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 44.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 6743.0, 9: 2.0}, 'DiscrepancyGreaterThan1000ft': {0: 1.0, 1: 1.0, 2: 1.0, 3: 2.0, 4: 1.0, 5: 2.0, 6: 1.0, 7: 2.0, 8: 1.0, 9: 1.0}, 'LocationLegitimate': {0: nan, 1: nan, 2: nan, 3: 1.0, 4: nan, 5: 1.0, 6: nan, 7: 1.0, 8: nan, 9: nan}, 'LocationErrorCategory': {0: nan, 1: nan, 2: nan, 3: 7.0, 4: nan, 5: 7.0, 6: nan, 7: 7.0, 8: nan, 9: nan}, 'LocationErrorComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'LocationErrorResolution': {0: nan, 1: nan, 2: nan, 3: 6.0, 4: nan, 5: 6.0, 6: nan, 7: 6.0, 8: nan, 9: nan}, 'LocationErrorResolutionComment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLatitudeDDM': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'CADLongitudeDDM': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}, 'GeocodeSymbology': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}, 'Lat/Long': {0: 'Latitude', 1: 'Latitude', 2: 'Latitude', 3: 'Latitude', 4: 'Latitude', 5: 'Latitude', 6: 'Latitude', 7: 'Latitude', 8: 'Latitude', 9: 'Latitude'}, 'CALFIRS_Coords': {0: '33 52.924', 1: '33 56.100', 2: '33 57.573', 3: '33 42.364', 4: '33 93.991', 5: '33 47.9629', 6: '33 47.390', 7: '33 54.385', 8: '33 44.859', 9: '33 40.793'}}
Try:
cols = ['Incident_Number', 'Lat/Long', 'GeoCodeCoordinates']
mergeMelt = GPSMelt.merge(GeoCodeMelt[cols], on=cols[:-1])
The KeyError: 'Incident_Number' is raised because you use GeoCodeMelt[['GeoCodeCoordinates']] so your columns Incident_Number and Lat/Long don't exist when you merge.
I am trying to inner merge two dask dataframes based on two ids namely doi and pmid.
The datasets look like this (only the head, feel free to modify the doi and pmid to construct a MWE):
dd_papers_all:
{'cited_by_count': {0: nan, 1: nan, 2: 9.0, 3: nan, 4: 30.0},
'cited_by_url': {0: 'None',
1: 'None',
2: "['W1968224982', 'W1977435724', 'W2003814720', 'W2006453929', 'W2015063028', 'W2139614344']",
3: 'None',
4: "['W181218938', 'W1969520123', 'W1970043627', 'W1977525191', 'W2006834484', 'W2057850214', 'W2062554850', 'W2070252209', 'W2098074569', 'W2123616561', 'W2150154625', 'W2408116868']"},
'authors': {0: "['Walczak M', 'Pawlaczyk J']",
1: "['Ioan Oliver Avram']",
2: "['T.M. Dmitrieva', 'T.P. Eremeeva', 'G.I. Alatortseva', 'Vadim I. Agol']",
3: "['Djurdjina Ružić', 'Tatjana Vujović', 'Gabriela Libiaková', 'Radosav Cerović', 'Alena Gajdošová']",
4: "['M. Harris']"},
'institutions': {0: '[]',
1: '[]',
2: "['Lomonosov Moscow State University', 'USSR Academy of Medical Sciences', 'Lomonosov Moscow State University', 'USSR Academy of Medical Sciences']",
3: "['Fruit Research Institute', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology', 'Fruit Research Institute', 'Institute of Plant Genetics and Biotechnology']",
4: "['Durban University of Technology']"},
'paper_id': {0: 'W155261221',
1: 'W145424619',
2: 'W1482328891',
3: 'W1581876373',
4: 'W1978891149'},
'pub_year_col': {0: 1969, 1: 2010, 2: 1980, 3: 2012, 4: 2008},
'level_0': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'level_1': {0: "['Ophthalmology', 'Pediatrics', 'Internal medicine', 'Endocrinology', 'Anatomy', 'Developmental psychology']",
1: "['Risk analysis (engineering)', 'Manufacturing engineering', 'Industrial engineering', 'Operations research', 'Mechanical engineering', 'Epistemology', 'Climatology', 'Macroeconomics', 'Operating system']",
2: "['Virology', 'Cell biology', 'Biochemistry', 'Quantum mechanics']",
3: "['Horticulture', 'Botany', 'Biochemistry']",
4: "['Pedagogy', 'Mathematics education', 'Paleontology', 'Social science', 'Neuroscience', 'Programming language', 'Quantum mechanics']"},
'level_2': {0: "['Craniopharyngioma', 'Fundus (uterus)', 'Girl', 'Ventricle', 'Fourth ventricle']",
1: "['Process (computing)', 'Production (economics)', 'Machine tool', 'Quality (philosophy)', 'Productivity', 'Multiple-criteria decision analysis', 'Machining', 'Forcing (mathematics)']",
2: "['Mechanism (biology)', 'Replication (statistics)', 'Virus', 'Gene']",
3: "['Vaccinium', 'In vitro']",
4: "['Negotiation', 'Reflective practice', 'Construct (python library)', 'Action research', 'Context (archaeology)', 'Reflective writing', 'Perception', 'Power (physics)']"},
'doi': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'pmid': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'mag': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
and dd_green_papers_frontier:
{'paperid': {0: 2006817976,
1: 2006817976,
2: 1972698438,
3: 1968223008,
4: 2149313415},
'uspto': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'doi': {0: '10.1016/J.AB.2003.07.015',
1: '10.1016/J.AB.2003.07.015',
2: '10.1007/S002170100404',
3: '10.1007/S002170100336',
4: '10.3324/%X'},
'pmid': {0: 14656521.0, 1: 14656521.0, 2: nan, 3: nan, 4: 12414351.0},
'publn_nr_x': {0: 2693, 1: 2693, 2: 2693, 3: 2693, 4: 2715},
'paperyear': {0: 2003.0, 1: 2003.0, 2: 2001.0, 3: 2001.0, 4: 2002.0},
'papertitle': {0: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
1: 'development of melting temperature based sybr green i polymerase chain reaction methods for multiplex genetically modified organism detection',
2: 'event specific detection of roundup ready soya using two different real time pcr detection chemistries',
3: 'characterisation of the roundup ready soybean insert',
4: 'hepatitis c virus infection in a hematology ward evidence for nosocomial transmission and impact on hematologic disease outcome'},
'magfieldid': {0: 149629.0,
1: 149629.0,
2: 143660080.0,
3: 40767140.0,
4: 2780572000.0},
'oecd_field': {0: '2. Engineering and Technology',
1: '2. Engineering and Technology',
2: '1. Natural Sciences',
3: '1. Natural Sciences',
4: '3. Medical and Health Sciences'},
'oecd_subfield': {0: '2.11 Other engineering and technologies',
1: '2.11 Other engineering and technologies',
2: '1.06 Biological sciences',
3: '1.06 Biological sciences',
4: '3.02 Clinical medicine'},
'wosfield': {0: 'Food Science & Technology',
1: 'Food Science & Technology',
2: 'Biochemical Research Methods',
3: 'Biochemistry & Molecular Biology',
4: 'Hematology'},
'author': {0: 2083283000.0, 1: 2808753700.0, 2: nan, 3: 2315123700.0, 4: nan},
'country_alpha3': {0: 'ESP', 1: 'ESP', 2: nan, 3: 'BEL', 4: nan},
'country_2': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'docdb_family_id': {0: 37137417,
1: 37137417,
2: 37137417,
3: 37137417,
4: 35462722},
'publn_nr_y': {0: 2693.0, 1: 2693.0, 2: 2693.0, 3: 2693.0, 4: 2715.0},
'cpc_class_interest': {0: 'Y02', 1: 'Y02', 2: 'Y02', 3: 'Y02', 4: 'Y02'}}
Specifically, doi is a string variable and pmid is a float.
Now, what I am trying to do (please feel free to suggest me a smarter way to merge the db since they are very large) is the following:
dd_papall_green = dd_papers_all.merge(
dd_green_papers_frontier,
how="inner",
on=["pmid", "doi"]
).persist()
but it fails with the error:
ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
Hence, what I did is to convert both doi and pmid to float as:
dd_papers_all['pmid']=dd_papers_all['pmid'].astype(float)
dd_papers_all['doi']=dd_papers_all['doi'].astype(float)
dd_green_papers_frontier['pmid']=dd_green_papers_frontier['pmid'].astype(float)
dd_green_papers_frontier['doi']=dd_green_papers_frontier['doi'].astype(float)
but again the merge fails.
How can I perform the described merge?
I'm trying to access a column in a pandas dataframe in python.
Here is the dataframe I am using. I only printed the first row in order to try to keep it a manageable size. I would have selected a smaller number of columns but you will see that this is my problem and the reason I'm asking this question.
{'OilSampleID': {0: 'TGSM3059'}, 'Type': {0: 'Oil Seep'}, 'Country': {0: 'Mexico'}, 'USGS Province': {0: 'Ocean'}, 'State / Province': {0: 'GOM'}, 'County / Block': {0: nan}, 'Block': {0: 'TGS Area 2'}, 'Section': {0: '11'}, 'Well': {0: 'VER115'}, 'Upper Depth (ft)': {0: nan}, 'Lower Depth (ft)': {0: nan}, 'Standardized Reservoir Age': {0: nan}, 'Reservoir Age': {0: nan}, 'Standardized Reservoir Formation': {0: nan}, 'Reservoir Formation': {0: nan}, 'Lithology': {0: nan}, 'Operator': {0: 'TDI-Brooks'}, 'Location': {0: nan}, 'Latitude': {0: 21.712965}, 'Longitude': {0: -96.280545}, 'Datum': {0: 'WGS84 UTM15N'}, 'API / UWI Well #': {0: nan}, 'Comments': {0: 'Piston Core Sediment'}, 'SampleID': {0: nan}, 'ClientID': {0: nan}, 'API Gravity': {0: nan}, '<C15': {0: nan}, '% S': {0: nan}, 'ppm Ni': {0: nan}, 'ppm V': {0: nan}, '% Sat': {0: 10.3448275862249}, '% Aro': {0: 13.7931034482896}, '% NSO': {0: 48.2758620689676}, '% Asph': {0: 27.5862068965179}, 'Sat / Aro': {0: 0.75}, '13Cs': {0: nan}, '13Ca': {0: -27.51}, '13Cwc': {0: nan}, '34Swc': {0: nan}, 'CV': {0: nan}, 'EOM': {0: 193.0}, 'Misc': {0: 'TSF = 150,000'}, 'Misc.1': {0: 150000.0}, 'SRType': {0: 'Marine Carbonate/Marl'}, 'SRTconf': {0: 'suspected'}, 'SRAge': {0: 'UJ/LK'}, 'SRAconf': {0: 'suspected'}, 'SRM': {0: nan}, 'BIOD': {0: 'Mild 4'}, 'BIOD_': {0: 4.0}, 'GM REF': {0: nan}, 'GM Fam': {0: nan}, 'C19/C23': {0: 0.018}, 'C22/C21': {0: 1.176}, 'C24/C23': {0: 0.405}, 'C26/C25': {0: 0.672}, 'Tet/C23': {0: 0.471}, 'C27T/C27': {0: 0.015}, 'C28/H': {0: 0.029}, 'C29/H': {0: 0.988}, 'X/H': {0: 0.013}, 'OL/H': {0: 0.009}, 'C31R/H': {0: 0.408}, 'GA/C31R': {0: 0.06}, 'C35S/C34S': {0: 1.131}, 'S/H': {0: 0.242}, '% C27': {0: 23.975}, '% C28': {0: 28.596}, '% C29': {0: 47.429}, 'S1/S6': {0: 0.887}, 'C29 20S/R': {0: 0.63}, 'C29 bbS/aaR': {0: 1.06}, 'C27 Ts/Tm': {0: 0.382}, 'C29 Ts/Tm': {0: 0.1}, 'DM/H': {0: 0.012}, 'C26(S+R) / Ts': {0: 0.316}, 'P': {0: 8.98723829264705}, '3MP': {0: 4.01903139633122}, '2MP': {0: 5.318803252166}, '9MP': {0: 5.38721229720994}, '1MP': {0: 3.85655991435188}, 'C4N': {0: 0.43610766215509}, 'DBT': {0: 0.0}, '4MDBT': {0: 0.256533918914759}, '32MDBT': {0: nan}, '1MDBT': {0: nan}, 'C20TAS': {0: 1.65036821168495}, 'C21TAS': {0: 2.32590753149381}, 'C26STAS': {0: 12.6898778556501}, 'C26RC27STAS': {0: 62.243679859351}, 'C28STAS': {0: 52.8801918189623}, 'C27RTAS': {0: 42.2254830533693}, 'C28RTAS': {0: 46.860195855096}, '#9b': {0: 8.0}, '#3e': {0: 11.0}, 'C18': {0: 0.0}, 'C19': {0: 1.0}, 'C20': {0: 1.0}, 'MPI': {0: 0.768292682926829}, 'F1': {0: 0.50253106304648}, 'F2': {0: 0.286240220892775}, 'P/DBT': {0: nan}, 'DBT/C4N': {0: nan}, 'MDR': {0: nan}, 'TAS1': {0: 0.0376145000974469}, 'TAS2': {0: 0.0472878998609179}, 'TAS3(CR)': {0: 0.0180023228803717}, 'TAS4': {0: 0.239974126778784}, 'TAS5': {0: 0.901094890510949}, 'DINO3/9': {0: 1.3740932642487}, 'C19T': {0: 0.181692648715433}, 'C20T': {0: 0.622946224167199}, 'C21T': {0: 1.62225579210208}, 'C22T': {0: 1.90777281151205}, 'C23T': {0: 9.92820544766473}, 'C24T': {0: 4.02319436441316}, 'C25S': {0: 2.336048340627}, 'C25R': {0: 2.336048340627}, 'TET': {0: 4.67209668125399}, 'C26S': {0: 1.60927774576526}, 'C26R': {0: 1.53140946774436}, 'ET_C28S': {0: 1.03824370694533}, 'ET_C28R': {0: 1.23291440199758}, 'ET_C29S': {0: 2.01159718220658}, 'ET_C29R': {0: 3.02388479647828}, 'ET_C30S': {0: 1.36269486536575}, 'ET_C30R': {0: 5.84012085156749}, 'ET_C31S': {0: 1.14206807763986}, 'ET_C31R': {0: 0.584012085156749}, 'Ts': {0: 9.92820544766473}, 'C27T': {0: 0.545077946146299}, 'Tm': {0: 26.0209829053174}, 'C28DM': {0: 2.40093857231108}, 'C28H': {0: 2.79027996241558}, 'C29DM': {0: 1.12909003130305}, 'C29H': {0: 96.2451916338322}, 'C29D': {0: 9.60375428924431}, 'C30X': {0: 1.29780463368166}, 'OL': {0: 0.882507150903532}, 'C30H': {0: 97.4391718968193}, 'C30M': {0: 8.70826909200397}, 'C31S': {0: 56.6751283528783}, 'C31R': {0: 39.7387778833326}, 'GA': {0: 2.38796052597426}, 'C32S': {0: 27.461546048704}, 'C32R': {0: 18.0135283155015}, 'C33S': {0: 20.6610497682121}, 'C33R': {0: 12.6146610393858}, 'C34S': {0: 12.458924483344}, 'C34R': {0: 7.7219375704059}, 'C35S': {0: 14.0941583217829}, 'C35R': {0: 8.53955448962535}, 'S1': {0: 7.30404447836041}, 'S2': {0: 4.12442312584033}, 'S3': {0: 4.85119372070206}, 'S4': {0: 11.337621279843}, 'S4B': {0: 10.4109887713943}, 'S5': {0: 6.32290417529707}, 'S5B': {0: 7.54024492169047}, 'S6': {0: 8.23067698680911}, 'S7': {0: 2.54369708201606}, 'S8': {0: 3.83371488789564}, 'S9': {0: 6.23205785093935}, 'S9B': {0: 8.394200370653}, 'S10': {0: 7.19502888913115}, 'S10B': {0: 8.99378611141393}, 'S11': {0: 3.67019150405175}, 'S12': {0: 7.4675678622043}, 'S13': {0: 14.0085032159599}, 'S13B': {0: 16.0071223518296}, 'S14': {0: 12.4641157018787}, 'S14B': {0: 14.916966459537}, 'ISTD': {0: 500.0}, 'S15': {0: 11.7918529016316}, 'Biodegraded': {0: nan}, '15': {0: nan}, '16': {0: nan}, '17': {0: nan}, 'Pr': {0: nan}, '18': {0: nan}, 'Ph': {0: nan}, '19': {0: nan}, '20': {0: nan}, '21': {0: nan}, '22': {0: nan}, '23': {0: nan}, '24': {0: nan}, '25': {0: nan}, '26': {0: nan}, '27': {0: nan}, '28': {0: nan}, '29': {0: nan}, '30': {0: nan}, '31': {0: nan}, '32': {0: nan}, '33': {0: nan}, '34': {0: nan}, '35': {0: nan}, 'Pr/Ph': {0: nan}, 'Pr/nC17': {0: nan}, 'Ph/nC18': {0: nan}, 'nC27/nC17': {0: nan}, 'nC19*2/(nC18+nC20)': {0: nan}, 'CPI': {0: nan}, 'S1/S6.1': {0: 0.887}, 'DWW D/R': {0: 1.1739236190043156}, 'DWW Ts/(Ts+Tm)': {0: 0.27617328519855566}, 'DWW 35SH/34SH': {0: 1.131}, 'DWW 29H/H': {0: 0.988}, 'DWW OL/H': {0: 0.009}, 'DWW GA/H': {0: 0.02450719232818326}, 'DWW 31H/H': {0: 0.9894778902504008}, 'DWW 29H/31H': {0: 0.9982501009557132}, 'DWW %27 St': {0: 23.975}, 'DWW %28 St': {0: 28.596}, 'DWW %29 St': {0: 47.429}, 'DWW M/H': {0: 0.08937133724027711}, 'DWW St/Ho': {0: 0.242}, 'DWW 29S/R': {0: 0.63}, 'Family': {0: 2}}
I looked on this StackOverflow post to try and find the answer, but I think I'm running into problems because my data has headers with spaces with them.
I followed the other example to try and select just some of the columns. Here is my code to try and select only a few.
geology_data_selection = geology_data[['OilSampleID', 'Type', 'Country', 'USGS Province', 'Well', 'Latitude', 'Longitude', 'EOM', 'Misc...43', 'BIOD', '% Sat', '% Aro', '% NSO', '% Asph']]
I was thinking the problem is that some of these columns like the % Sat column has a space in it. One of the comments under that SO post told me to use backpacks, so I tried using backpacks but that didn't work either.
geology_data_selection = geology_data[['OilSampleID', 'Type', 'Country', '`USGS Province`', 'Well', 'Latitude', 'Longitude', 'EOM', 'Misc...43', 'BIOD', '`% Sat`', '`% Aro`', '`% NSO`', '`% Asph`']]
How can I select columns from a pandas data frame explicitly when the column names are all that great (some of them with the ... and some of them with the spaces).
If you supply keys that exist... this works just fine.
df[['OilSampleID', 'Type', 'Country', 'USGS Province', 'Well', 'Latitude', 'Longitude', 'EOM', 'BIOD', '% Sat', '% Aro', '% NSO', '% Asph']]
Output:
OilSampleID Type Country USGS Province Well Latitude Longitude EOM BIOD % Sat % Aro % NSO % Asph
0 TGSM3059 Oil Seep Mexico Ocean VER115 21.712965 -96.280545 193.0 Mild 4 10.344828 13.793103 48.275862 27.586207
I am trying to merge two excels, my data is:
tabla muestra.xlsx
{'Mandante': {0: 400, 1: 400, 2: 400, 3: 400, 4: 400}, 'Usuario': {0: 152163681, 1: '162181297', 2: '144912861', 3: '140752630', 4: '167300316'}, 'Funcion': {0: 'COMPRADOR', 1: 'JEFE DE COMPRAS', 2: 'COMPRADOR', 3: 'COMPRADOR', 4: 'JEFE DE COMPRAS'}, 'Tipo usuario contractual': {0: 'SAP Application Professional', 1: 'SAP Application Professional', 2: 'SAP Application Professional', 3: 'SAP Application Professional', 4: 'SAP Application Professional'}}
and tabla usuarios roles.xlsx
{'Identificación mdte.': {0: 400, 1: 400, 2: 400, 3: 400, 4: 400}, 'Rol': {0: 'SAP_BC_WEBSERVICE_ADMIN', 1: 'SAP_BC_WEBSERVICE_CONSUMER', 2: 'SAP_BC_WEBSERVICE_SERVICE_USER', 3: 'SAP_J2EE_ADMIN', 4: 'SAP_SDCCN_ALL'}, 'Usuario': {0: 'WEBSERVICE', 1: 'WEBSERVICE', 2: 'WEBSERVICE', 3: 'SM_ADMIN_S4P', 4: 'ADMIN_SONDA'}, 'Fecha de inicio': {0: '01.03.2019', 1: '01.03.2019', 2: '01.03.2019', 3: '16.05.2019', 4: '06.08.2019'}, 'Fecha fin': {0: '31.12.9999', 1: '31.12.9999', 2: '31.12.9999', 3: '31.12.9999', 4: '31.12.9999'}, 'Excluido': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}, 'Fecha': {0: '01.03.2019', 1: '01.03.2019', 2: '01.03.2019', 3: '16.05.2019', 4: '06.08.2019'}, 'Hora': {0: datetime.time(16, 11, 6), 1: datetime.time(16, 11, 6), 2: datetime.time(16, 11, 6), 3: datetime.time(15, 27, 30), 4: datetime.time(9, 25, 57)}, 'Cronomarcador UTC en forma breve (AAAAMMDDhhmmss)': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0}, 'Org.HR': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}, 'Asign.proviene de rol compuesto': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
Using the code
# importing the module
import pandas
# reading the files
f1 = pandas.read_excel("~/Desktop/tabla muestra.xlsx")
f2 = pandas.read_excel("~/Desktop/tabla usuarios roles.xlsx")
# merging the files
f3 = f1[["Usuario"]].merge(f2[["Usuario", "Rol"]],
on = "Usuario",
how = "outer")
# creating a new file
f3.to_excel("~/Desktop/Resultstest5.xlsx", index = False)
After the code it returns the following
I chacked and the ids are on both tables, any clues whats happening?
Trying to resolve the error:
application.py:25: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
application.py:26: SettingWithCopyWarning:
but can't figure out why i'm getting this error and how to resolve it.
This is my code:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
#report month to datetime
sales['Report month'] = pd.to_datetime(sales['Report month'])
hr['Month'] = pd.to_datetime(hr['Month'])
#remove sales where customer churned
sales_clean = sales.loc[sales['Cod. Motivo Desconexion'] == 0]
sales_clean = sales_clean[['Report month','Rental Charge','ID Vendedor']]
sales_clean2 = pd.DataFrame(sales_clean.groupby(['Report month','ID Vendedor'])['Rental Charge'].sum())
sales_clean2.reset_index(inplace=True)
hr_area = hr.loc[hr['Area'] == 'Area 1']
merged_hr = hr_area.merge(sales_clean, left_on=['SalesSystemCode','Month'],right_on=['ID Vendedor','Report month'],how='left')
#creating new features: months of employment
merged_hr['MonthsofEmploymentRounded'] = round((merged_hr['Month'] - merged_hr['HireDate'])/np.timedelta64(1,'M')).astype('int')
#filters for interaction
YEAR_MONTH = merged_hr['Month'].unique()
#css stylesheet
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
#html layout
app.layout = html.Div(children=[
html.H1(children='SAC Challenge Level 2 Dashboard', style ={
'textAlign': 'center',
'height':'10'
}),
html.Div(children='''
Objective: Studying the impact of supervision on the performance of sales executives in Area 1
'''),
dcc.DatePickerRange(
id='year_month',
start_date= min(merged_hr['Month'].dt.date.tolist()),
end_date = 'Select date'
),
dcc.Graph(
id='performancetable'
)
])
#app.callback(dash.dependencies.Output('performancetable','figure'),
[dash.dependencies.Input('year_month', 'start_date'),
dash.dependencies.Input('year_month','end_date')])
def update_table(year_month):
if year_month is None or year_month ==[]:
year_month = YEAR_MONTH
performance = merged_hr[(merged_hr['Month'].isin(year_month))]
return {
'data': [
go.Table(
header = dict(values=list(performance.columns),fill_color='paleturquoise',align='left'),
cells = dict(values=[performance['Month'],performance['SalesSystemCode'],performance['TITULO'],
performance['HireDate'],performance['MonthsofEmploymentRounded'],performance['SupervisorEmployeeID'],
performance['BASE'],performance['carallowance'],performance['Commission_Target'],
performance['Fulfilment %'], performance['Commission Accrued'],performance['Commission paid'],
performance['Características (D)'],performance['Características (I)'],performance['Características (S)'],
performance['Características (C)'],performance['Motivación (D)'],performance['Motivación (I)'],
performance['Motivación (S)'],performance['Motivación (C)'],performance['Bajo Stress (D)'],
performance['Bajo Stress (I)'],performance['Bajo Stress (S)'],performance['Bajo Stress (C)'],
performance['Rental Charge']])
)],
}
if __name__ == '__main__':
app.run_server(debug=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Here is a sample of hr_data:
{'Month': {0: Timestamp('2017-12-01 00:00:00'),
1: Timestamp('2017-12-01 00:00:00'),
2: Timestamp('2017-12-01 00:00:00'),
3: Timestamp('2017-12-01 00:00:00'),
4: Timestamp('2017-12-01 00:00:00')},
'EmployeeID': {0: 91868, 1: 1812496, 2: 1812430, 3: 700915, 4: 1812581},
'PayrollProviderName': {0: 'Tele',
1: 'People',
2: 'People',
3: 'Stratego',
4: 'People'},
'SalesSystemCode': {0: 91868.0,
1: 802496.0,
2: 2430.0,
3: 700915.0,
4: 802581.0},
'Payroll Type': {0: 'Insourcing',
1: 'Third Party',
2: 'Third Party',
3: 'Third Party',
4: 'Third Party'},
'Name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'TITULO': {0: 'SALES SUPERVISOR',
1: 'SALES EXECUTIVE',
2: 'SALES EXECUTIVE',
3: 'SALES EXECUTIVE',
4: 'SALES EXECUTIVE'},
'Sexo': {0: 'M', 1: 'F', 2: 'F', 3: 'M', 4: 'F'},
'BirthDate': {0: Timestamp('1982-11-05 00:00:00'),
1: Timestamp('1987-09-24 00:00:00'),
2: Timestamp('1981-01-13 00:00:00'),
3: Timestamp('1986-04-18 00:00:00'),
4: Timestamp('1991-06-24 00:00:00')},
'HireDate': {0: Timestamp('2012-04-23 00:00:00'),
1: Timestamp('2017-04-10 00:00:00'),
2: Timestamp('2017-03-13 00:00:00'),
3: Timestamp('2015-01-22 00:00:00'),
4: Timestamp('2017-05-18 00:00:00')},
'SupervisorEmployeeID': {0: 7935, 1: 91868, 2: 91868, 3: 91868, 4: 91868},
'SupervisorName': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'BASE': {0: 895, 1: 700, 2: 700, 3: 700, 4: 700},
'carallowance': {0: 350, 1: 250, 2: 250, 3: 250, 4: 250},
'Commission_Target': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'Nacionalidad': {0: 'INT', 1: 'INT', 2: 'INT', 3: 'INT', 4: 'INT'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Comment': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sales Quota (points)': {0: 1810.0, 1: 108.0, 2: 108.0, 3: 108.0, 4: 108.0},
'Real (points)': {0: 1855.0, 1: 86.0, 2: 245.0, 3: 149.0, 4: 91.0},
'Fulfilment %': {0: 1.0248618784530388,
1: 0.7962962962962963,
2: 2.2685185185185186,
3: 1.3796296296296295,
4: 0.8425925925925926},
'Commission Accrued': {0: 708.33, 1: 583.33, 2: 583.33, 3: 583.33, 4: 583.33},
'OA Commission Accrued': {0: 653.66,
1: 87.5,
2: 1494.79,
3: 794.79,
4: 160.42},
'Clawback': {0: 0.0, 1: 24.33, 2: 144.9, 3: 36.77, 4: 0.0},
'Other Commissions': {0: 0.0, 1: 0.0, 2: 9.16, 3: 9.16, 4: 0.0},
'Commission paid': {0: 1361.99, 1: 646.51, 2: 1942.38, 3: 1350.52, 4: 743.75},
'Exit Date': {0: NaT,
1: Timestamp('2018-04-13 00:00:00'),
2: NaT,
3: NaT,
4: Timestamp('2018-08-31 00:00:00')},
'Legal Motive': {0: nan,
1: 'Artículo No. 212',
2: nan,
3: nan,
4: 'Artículo No. 212'},
'Características (D)': {0: nan, 1: 70.0, 2: 70.0, 3: 60.0, 4: 67.0},
'Características (I)': {0: nan, 1: 95.0, 2: 62.0, 3: 25.0, 4: 15.0},
'Características (S)': {0: nan, 1: 20.0, 2: 48.0, 3: 75.0, 4: 40.0},
'Características (C)': {0: nan, 1: 25.0, 2: 34.0, 3: 85.0, 4: 94.0},
'Motivación (D)': {0: nan, 1: 85.0, 2: 75.0, 3: 40.0, 4: 59.0},
'Motivación (I)': {0: nan, 1: 95.0, 2: 74.0, 3: 74.0, 4: 25.0},
'Motivación (S)': {0: nan, 1: 11.0, 2: 58.0, 3: 65.0, 4: 65.0},
'Motivación (C)': {0: nan, 1: 7.0, 2: 33.0, 3: 84.0, 4: 93.0},
'Bajo Stress (D)': {0: nan, 1: 60.0, 2: 69.0, 3: 79.0, 4: 79.0},
'Bajo Stress (I)': {0: nan, 1: 86.0, 2: 60.0, 3: 6.0, 4: 18.0},
'Bajo Stress (S)': {0: nan, 1: 40.0, 2: 60.0, 3: 89.0, 4: 30.0},
'Bajo Stress (C)': {0: nan, 1: 60.0, 2: 48.0, 3: 84.0, 4: 92.0}}
sales_data:
{'Month': {0: Timestamp('2017-07-01 00:00:00'),
1: Timestamp('2017-07-01 00:00:00'),
2: Timestamp('2017-07-01 00:00:00'),
3: Timestamp('2017-07-01 00:00:00'),
4: Timestamp('2017-07-01 00:00:00')},
'Report month': {0: '2017-07',
1: '2017-07',
2: '2017-07',
3: '2017-07',
4: '2017-07'},
'Area': {0: 'Area 1', 1: 'Area 1', 2: 'Area 1', 3: 'Area 1', 4: 'Area 1'},
'Fecha de solicitud': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de salida': {0: Timestamp('2017-07-27 13:11:42'),
1: Timestamp('2017-07-27 15:08:39'),
2: Timestamp('2017-07-04 11:50:07'),
3: Timestamp('2017-07-07 16:40:44'),
4: Timestamp('2017-07-14 14:52:45')},
'Fecha de salida final': {0: Timestamp('2017-07-28 15:13:53'),
1: Timestamp('2017-07-27 15:46:16'),
2: Timestamp('2017-07-05 10:24:46'),
3: Timestamp('2017-07-08 08:36:43'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de proceso': {0: Timestamp('2017-08-01 00:00:00'),
1: Timestamp('2017-08-01 00:00:00'),
2: Timestamp('2017-08-01 00:00:00'),
3: Timestamp('2017-08-01 00:00:00'),
4: Timestamp('2017-08-01 00:00:00')},
'Fecha de sistema': {0: Timestamp('2017-07-25 14:49:51'),
1: Timestamp('2017-07-25 14:56:14'),
2: Timestamp('2017-06-30 13:07:10'),
3: Timestamp('2017-07-03 18:25:17'),
4: Timestamp('2017-07-04 09:56:24')},
'Fecha de completada': {0: Timestamp('2017-07-28 15:13:52'),
1: Timestamp('2017-07-27 15:46:15'),
2: Timestamp('2017-07-05 10:24:45'),
3: Timestamp('2017-07-08 08:36:42'),
4: Timestamp('2017-07-15 10:00:02')},
'Fecha de creada': {0: Timestamp('2017-07-25 14:50:00'),
1: Timestamp('2017-07-25 14:56:00'),
2: Timestamp('2017-06-30 13:07:00'),
3: Timestamp('2017-07-03 18:25:00'),
4: Timestamp('2017-07-04 09:56:00')},
'Cod. de Distribucion': {0: 2302, 1: 2302, 2: 2302, 3: 91818, 4: 2302},
'Customer': {0: 19308378, 1: 19308378, 2: 27504455, 3: 27104497, 4: 17608676},
'Cod. Tipo Cliente': {0: 'R', 1: 'R', 2: 'R', 3: 'R', 4: 'R'},
'Tipo De Cliente': {0: 'Residencial ',
1: 'Residencial ',
2: 'Residencial ',
3: 'Residencial ',
4: 'Residencial '},
'Cuenta': {0: 193083780000,
1: 193083780000,
2: 275044550000,
3: 271044970000,
4: 176086760000},
'Status Cuenta': {0: 'W', 1: 'W', 2: 'W', 3: 'W', 4: 'W'},
'Tipo de Contabilidad': {0: 'RP', 1: 'RP', 2: 'RP', 3: 'RP', 4: 'RP'},
'Desc. Tipo Contabilidad': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tos Cat': {0: 'K', 1: 'K', 2: 'K', 3: 'K', 4: 'K'},
'Desc. Tos Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Mktg Cat': {0: 990005.0, 1: 990005.0, 2: 990000.0, 3: 990000.0, 4: 990000.0},
'Desc. Mktg Cat': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Bill Sort': {0: 571.0, 1: 571.0, 2: 571.0, 3: 691.0, 4: 256.0},
'Orden de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Comando': {0: 'PMO', 1: 'PFB', 2: 'PMO', 3: 'PMO', 4: 'PMO'},
'Desc. Comando': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Prioridad': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5},
'Cod. Línea': {0: 3, 1: 2, 2: 1, 3: 1, 4: 1},
'Número de Servicio': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Producto': {0: 1420, 1: 31000, 2: 1403, 3: 1404, 4: 1404},
'Desc. Producto': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Sub Familia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Rental Charge': {0: 22.5,
1: 18.7125,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Inst Charge': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Control': {0: 'CONEXIONES_COMPLETADAS_CT',
1: 'CONEXIONES_COMPLETADAS_CT',
2: 'CONEXIONES_COMPLETADAS',
3: 'CONEXIONES_COMPLETADAS',
4: 'CONEXIONES_COMPLETADAS'},
'Cod. Estatus': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A'},
'Status': {0: 'Por Acción ',
1: 'Por Acción ',
2: 'Por Acción ',
3: 'Por Acción ',
4: 'Por Acción '},
'Cod Razon Pendiente': {0: ' ', 1: ' ', 2: ' ', 3: ' ', 4: ' '},
'Razon Pendiente': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Motivo Desconexion': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Motivo Desconexion': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Cod. Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Agencia': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'ID Vendedor': {0: 2352.0, 1: 2352.0, 2: 2352.0, 3: 2352.0, 4: 2352.0},
'ID Oficinista': {0: 229113.0,
1: 229113.0,
2: 224666.0,
3: 221532.0,
4: 224666.0},
'ID Acct Manager': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'Desc. Acct Manager': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Provincia': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'},
'Central': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Chrg Prod Ant': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Tipo Srv': {0: 'MO', 1: 'TI', 2: 'MO', 3: 'MO', 4: 'MO'},
'Tipo Srv Desc': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'Diferencia ': {0: 2.5500000000000007,
1: 0.0,
2: 15.257499999999999,
3: 19.95,
4: 19.95},
'Puntos ': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
#QuanHoang was pointing in the right direction with his comment, but you need to add .copy() for both the hr and sales dataframes:
hr = hr_data[['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']].copy()
sales = sales_data[['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']].copy()
Using .copy() works because it creates a full copy of the data, rather than a view. Subsequent indexing operations work correctly on the copy.
Another option is to use .loc[] indexing when you do the selection from hr_data and sales_data. This should also work:
hr = hr_data.loc[:, ['Month','SalesSystemCode','TITULO','BirthDate','HireDate','SupervisorEmployeeID','BASE','carallowance','Commission_Target','Area','Fulfilment %','Commission Accrued','Commission paid',
'Características (D)', 'Características (I)', 'Características (S)','Características (C)', 'Motivación (D)', 'Motivación (I)','Motivación (S)', 'Motivación (C)', 'Bajo Stress (D)',
'Bajo Stress (I)', 'Bajo Stress (S)', 'Bajo Stress (C)']]
sales = sales_data.loc[:, ['Report month', 'Area','Customer','Rental Charge','Cod. Motivo Desconexion','ID Vendedor']]
Note that selecting columns with .loc[] uses the format df.loc[:, [ *columns* ] becasue .loc[] requires specifying the rows explicitly.
Using .loc[] works because .loc[] (and .iloc[]) indexing return a reference to the original dataframe, but with updated indexing behavior which is not subject to the 'setting with copy' problems.